Skip to content

RedHoven/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec: Skip-Gram with Negative Sampling on Simple English Wikipedia dataset

Implementation of the Word2Vec algorithm using the Skip-gram model with Negative Sampling. This code trains word embeddings on a given text corpus and allows for interactive exploration of similar words based on the learned embeddings. The dataset was taken from Simple English Wikipedia (Kaggle).

The trained embeddings and training logs are saved in the artifacts directory.

word2vec.py - main implementation of the Word2Vec algorithm, including data preprocessing, training loop, and interactive exploration of similar words. math.md - mathematical derivation of the Skip-gram with Negative Sampling loss function and its gradients. result.ipynb - notebook containing visualizations of the loss and resulting embeddings using SVD.

How to run?

  • download the dataset and artifacts (if needed) from Google Drive
  • install dependencies: pip install -r requirements.txt
  • run training from scratch: python word2vec.py --train
  • run interactive mode: python word2vec.py --interactive

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors