- https://code.google.com/archive/p/word2vec/source/default/source
- Changed Word2vec c code to Java
- Put "Input.txt" in the folder containing the source code
The contents of Input.txt are as follows
There is one document per line
All documents must be preprocessed
- Preprocessing: documents should be separated by words using morphemes
- In Eclipse, you mush give arguments (Run - Run Configurations...)
- a = input.txt, b = output.txt... That is, the name of the input output file.
- but, in Code I hava set it (Line 34, 35)
- Document 1 : KimJunho is interested in machine learning and deep learning
- Document 2 : KimJunho is interested in recruiting professional researches
KimJunho isterested machine learning deep learning
KimJunho recruiting professional researchers
- See Line 894 (public static class Builder)
1. cbow = false
Which of the cbow and skip-gram models to learn ?
false : use skip gram
true : use cbow model
2. startingAlpha = 0.025F
This is a learningrate
The smaller the value, the more accurate the learning, but the slower the learning speed
3. window = 5
How many words to look around when learning
The default value is 5, meaning that you see 5 words
4. negative = 0
It can be used to improve the efficiency of calculation speed
Methodology has Hierarchical Softmax and Negative Sampling
If 0, Hierarchical Softmax
else, Negative Sampling.. default value 5~10
5. minCount = 5
Meaning that I will only see words from at least a few words in the document
If you want to learn every word, minCount = 0
6. layerOneSize = 200
Mean dimension of word vector
default value is 200
The higher the dimension, the more precise it is, but the learning speed is slower