Skip to content

StaniszewskiA/numpy_word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word2vec

A from-scratch NumPy implementation of Word2Vec (Skip-Gram with Negative Sampling) trained on the NLTK Gutenberg corpus.

Project structure

word2vec/
├── src/
│   ├── __init__.py        # Public API exports
│   ├── vocabulary.py      # Vocabulary class (token <-> index, subsampling, neg-sampling probas)
│   ├── model.py           # Word2Vec class (SGNS training, embeddings, nearest neighbours)
│   ├── data.py            # Gutenberg corpus loader
│   └── logger.py          # Singleton logger
├── main.py                # Entry point
└── requirements.txt

Requirements

  • Python 3.10+
  • numpy
  • nltk

Install dependencies:

pip install -r requirements.txt

Usage

python main.py

To limit the number of sentences loaded (useful for quick experiments):

python main.py --n-samples 5000

This will:

  1. Download the NLTK Gutenberg corpus (first run only).
  2. Build a vocabulary (words with min_freq >= 5).
  3. Train a Skip-Gram model with Negative Sampling for 5 epochs.
  4. Print the nearest neighbours for a set of query words.

About

word2vec implementation in pure NumPy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages