A simple implementation of Skip-Gram model in PyTorch.
main.py —— the training process
model.py —— model's definition
getData.py —— pre-processing and organizing data(import torch.utils.data.DataLoader to enable batch)
text8、simtext2 —— the data files, "simtext2" is smaller.
If you encounter the problem "RuntimeWarning: divide by zero encountered in true_divide sampling_p = (np.sqrt(fre_np / 0.001) + 1) * 0.001 / fre_np", you should probably consider decreasing the value of vacabulary_size(for example 1000), because you may be using smaller dataset.
The results of english text are as follow, the chinese word vectors are still be training.
task | this repo | CCL2017 paper |
---|---|---|
word relatedness | 69.88% | 69.36% |
syntactic question | 16.84% | 54.24% |
semantic question | 9.59% | 45.59% |