Small NumPy-only trainer: builds a vocabulary from plain text, samples Skip-gram (center, context) pairs with a random window, and learns separate input/output embedding matrices with negative sampling.
- Python 3.10+
numpy
Point training at a UTF-8 text file. The script defaults to text8.txt (e.g. the text8 corpus). Place that file in the project directory or pass another path via run_training(text_path=...).
python skipgram_neg_sampleing.pymain() calls run_training() with a small max_chars and more epochs for a quick demo. For full control (window size, batch size, negative samples, etc.), import run_training from skipgram_neg_sampleing or extend the script.
| File | Role |
|---|---|
dataset.py |
Tokenize text, vocab + counts, negative-sampling distribution, Skip-gram pairs |
skipgram_neg_sampleing.py |
Loss, gradients, SGD training loop |
Trained input embeddings are the usual word vectors (input_emb returned from run_training).