Small NumPy-based Word2Vec implementation using skip-gram with negative sampling.
word2vec/
|-- main.py
|-- requirements.txt
|-- README.md
|-- word2vec/
| |-- __init__.py
| |-- corpus.py
| |-- vocab.py
| |-- data.py
| |-- model.py
| |-- trainer.py
| `-- evaluate.py
`-- tests/
`-- test_word2vec.py
Create and activate a virtual environment, then install dependencies:
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txtTrain the demo model from the project root:
python main.pyThe script will:
- Build a vocabulary from the sample corpus.
- Generate skip-gram pairs.
- Train embeddings with negative sampling.
- Print nearest neighbours for a few probe words.
- Save weights to
word2vec_weights.npz.
Run the full test suite with:
python -m pytest -qCurrent status: 39 tests passing.
word2vec/corpus.py: sample corpus text and tokeniser.word2vec/vocab.py: vocabulary lookup tables and noise distribution.word2vec/data.py: skip-gram pair generation helpers.word2vec/model.py: sigmoid, Word2Vec parameters, SGNS update step.word2vec/trainer.py: training loop and hyperparameter config.word2vec/evaluate.py: cosine similarity and nearest-neighbour utilities.