A from-scratch NumPy implementation of Word2Vec (Skip-Gram with Negative Sampling) trained on the NLTK Gutenberg corpus.
word2vec/
├── src/
│ ├── __init__.py # Public API exports
│ ├── vocabulary.py # Vocabulary class (token <-> index, subsampling, neg-sampling probas)
│ ├── model.py # Word2Vec class (SGNS training, embeddings, nearest neighbours)
│ ├── data.py # Gutenberg corpus loader
│ └── logger.py # Singleton logger
├── main.py # Entry point
└── requirements.txt
- Python 3.10+
- numpy
- nltk
Install dependencies:
pip install -r requirements.txtpython main.pyTo limit the number of sentences loaded (useful for quick experiments):
python main.py --n-samples 5000This will:
- Download the NLTK Gutenberg corpus (first run only).
- Build a vocabulary (words with
min_freq >= 5). - Train a Skip-Gram model with Negative Sampling for 5 epochs.
- Print the nearest neighbours for a set of query words.