Word2Vec: Skip-Gram with Negative Sampling on Simple English Wikipedia dataset

Implementation of the Word2Vec algorithm using the Skip-gram model with Negative Sampling. This code trains word embeddings on a given text corpus and allows for interactive exploration of similar words based on the learned embeddings. The dataset was taken from Simple English Wikipedia (Kaggle).

The trained embeddings and training logs are saved in the artifacts directory.

word2vec.py - main implementation of the Word2Vec algorithm, including data preprocessing, training loop, and interactive exploration of similar words. math.md - mathematical derivation of the Skip-gram with Negative Sampling loss function and its gradients. result.ipynb - notebook containing visualizations of the loss and resulting embeddings using SVD.

How to run?

download the dataset and artifacts (if needed) from Google Drive
install dependencies: pip install -r requirements.txt
run training from scratch: python word2vec.py --train
run interactive mode: python word2vec.py --interactive

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
math.md		math.md
readme.md		readme.md
requirements.txt		requirements.txt
result.ipynb		result.ipynb
word2vec.py		word2vec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec: Skip-Gram with Negative Sampling on Simple English Wikipedia dataset

How to run?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Word2Vec: Skip-Gram with Negative Sampling on Simple English Wikipedia dataset

How to run?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages