skip-gram

A simple implementation of Skip-Gram model in PyTorch.

Files organization

main.py —— the training process

model.py —— model's definition

getData.py —— pre-processing and organizing data(import torch.utils.data.DataLoader to enable batch)

text8、simtext2 —— the data files, "simtext2" is smaller.

If you encounter the problem "RuntimeWarning: divide by zero encountered in true_divide sampling_p = (np.sqrt(fre_np / 0.001) + 1) * 0.001 / fre_np", you should probably consider decreasing the value of vacabulary_size(for example 1000), because you may be using smaller dataset.

Results

The results of english text are as follow, the chinese word vectors are still be training.

task	this repo	CCL2017 paper
word relatedness	69.88%	69.36%
syntactic question	16.84%	54.24%
semantic question	9.59%	45.59%

References

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

Li, Fang and Xiaojie Wang. “Improving Word Embeddings for Low Frequency Words by Pseudo Contexts.” CCL (2017).

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.idea		.idea
evaluation		evaluation
README.md		README.md
getData.py		getData.py
model.py		model.py
text8		text8
utils.py		utils.py
word2vec.py		word2vec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skip-gram

Files organization

Results

References

About

Releases

Packages

Languages

Junpliu/skip-gram

Folders and files

Latest commit

History

Repository files navigation

skip-gram

Files organization

Results

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages