GitHub - MrWaterZhou/ETM_tf: Tensorflow inplements of "Topic Modeling in Embedding Spaces" by Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. (Arxiv link: https://arxiv.org/abs/1907.04907)

original pytorch implement: https://github.com/adjidieng/ETM

Dependencies

python 3.7
tensorflow 2.1.0

Preprocess Data

tokenize text and generate vocabulary file(vocab.txt, first line must be "<PAD>")
encode text as ids(data.txt, a sentence per line, you don't have to pad each line to a same length)
train word2vec(embeddings.txt, you can use code from the original implement https://github.com/adjidieng/ETM)

To Run

To learn interpretable embeddings and topics :

python train.py --data_path data.txt --batch_size 512 --vocab_path vocab.txt --train_embeddings 1 --lr 0.0005 --epochs 1000

To learn interpretable topics using ETM with pre-fitted word embeddings :

python train.py --data_path data.txt --batch_size 512 --vocab_path vocab.txt --train_embeddings 0 --lr 0.0005 --epochs 1000 --emb_path embeddings.txt```

Some Changes

Using sequence of word ids as model input, easier to pre-process, but may cost more video memory(depend on the lenth of input)
Using DenseEncoder instead of original BOW encoder(mathematically equivalent), since you may encounter with nan weight using sparse matrix
Metrics like ppl are not implemented
Predefined topic words(may help to get reasonable topics)

Citation

@article{dieng2019topic,
  title={Topic modeling in embedding spaces},
  author={Dieng, Adji B and Ruiz, Francisco J R and Blei, David M},
  journal={arXiv preprint arXiv:1907.04907},
  year={2019}
}

Name		Name	Last commit message	Last commit date
Latest commit History 149 Commits
.gitignore		.gitignore
README.md		README.md
create_dataset.py		create_dataset.py
etm.py		etm.py
predefine.txt		predefine.txt
predict.py		predict.py
train.py		train.py
utils.py		utils.py
vocab_new.txt		vocab_new.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

original pytorch implement: https://github.com/adjidieng/ETM

Dependencies

Preprocess Data

To Run

Some Changes

Citation

About

Releases

Packages

Languages

MrWaterZhou/ETM_tf

Folders and files

Latest commit

History

Repository files navigation

original pytorch implement: https://github.com/adjidieng/ETM

Dependencies

Preprocess Data

To Run

Some Changes

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages