original pytorch implement: https://github.com/adjidieng/ETM

Dependencies

python 3.7
tensorflow 2.1.0

Preprocess Data

tokenize text and generate vocabulary file(vocab.txt, first line must be "<PAD>")
encode text as ids(data.txt, a sentence per line, you don't have to pad each line to a same length)
train word2vec(embeddings.txt, you can use code from the original implement https://github.com/adjidieng/ETM)

To Run

To learn interpretable embeddings and topics :

python train.py --data_path data.txt --batch_size 512 --vocab_path vocab.txt --train_embeddings 1 --lr 0.0005 --epochs 1000

To learn interpretable topics using ETM with pre-fitted word embeddings :

python train.py --data_path data.txt --batch_size 512 --vocab_path vocab.txt --train_embeddings 0 --lr 0.0005 --epochs 1000 --emb_path embeddings.txt```

Some Changes

Using sequence of word ids as model input, easier to pre-process, but may cost more video memory(depend on the lenth of input)
Using DenseEncoder instead of original BOW encoder(mathematically equivalent), since you may encounter with nan weight using sparse matrix
Metrics like ppl are not implemented
Predefined topic words(may help to get reasonable topics)

Citation

@article{dieng2019topic,
  title={Topic modeling in embedding spaces},
  author={Dieng, Adji B and Ruiz, Francisco J R and Blei, David M},
  journal={arXiv preprint arXiv:1907.04907},
  year={2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

original pytorch implement: https://github.com/adjidieng/ETM

Dependencies

Preprocess Data

To Run

Some Changes

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

original pytorch implement: https://github.com/adjidieng/ETM

Dependencies

Preprocess Data

To Run

Some Changes

Citation