Skip to content

MrNo001/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word2vec (Skip-gram + negative sampling)

Small NumPy-only trainer: builds a vocabulary from plain text, samples Skip-gram (center, context) pairs with a random window, and learns separate input/output embedding matrices with negative sampling.

Setup

  • Python 3.10+
  • numpy

Data

Point training at a UTF-8 text file. The script defaults to text8.txt (e.g. the text8 corpus). Place that file in the project directory or pass another path via run_training(text_path=...).

Run

python skipgram_neg_sampleing.py

main() calls run_training() with a small max_chars and more epochs for a quick demo. For full control (window size, batch size, negative samples, etc.), import run_training from skipgram_neg_sampleing or extend the script.

Layout

File Role
dataset.py Tokenize text, vocab + counts, negative-sampling distribution, Skip-gram pairs
skipgram_neg_sampleing.py Loss, gradients, SGD training loop

Trained input embeddings are the usual word vectors (input_emb returned from run_training).

About

Word2vec algorithm implemented using skip-gram with negative sampleing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages