WeTe: Representing mixtures of Word Embeddings with mixtures of Topic Embedding

This is the official implementation of our paper Representing mixtures of Word Embeddings with mixtures of Topic Embedding in ICLR2022.

The proposed WeTe is a new topic modeling framework that views a document as a set of its word embeddings, and views topics as a set of embedding vectors shared over all documents. Topic embeddings and the document proportions are learned by minimizing the bidirectional transport cost between those two sets.

Getting Started

Install

Clone this repo:

git clone git@github.com:wds2014/WeTe.git
cd WeTe

Install pytorch with cuda and other requirements as you need.

Dataset

Datasets in our paper

All datasets can be downloaded from google driver.

Customising your own dataset

Organizing the Bow and the vocabulary of the corpus into the form WeTe expects according to the provided .pkl file in dataset folder and the dataloader.py file, and happy to try WeTe !

Pretrained word embeddings

We recommend loading the pre-trained word embeddings for better results.

Glove

the pretrained glove word embeddings can be downloaded from Glove.

Or, training (finetuning) the word embeddings for the corpus with word2vec tool.

Training

Easy to train:

python main.py

Changing the arguments in main.py for different datasets and settings. The learned topics are saved in runs folder.

Evaluation

Clustering and Classification

We have provided the K-means clustering and LogisticRegression classification codes in cluster_clc.py file. Those results are auto-reported during training.

Topic quality

We have provided the topic diversity in Trainer.py. For topic coherence, please refer to Palmetto, which is not provided in this repo. One needs to download and set up separately.

Citation

If you find this repo useful to your project, please consider to cite it with following bib:

@article{wang2022representing,
  title={Representing Mixtures of Word Embeddings with Mixtures of Topic Embeddings},
  author={Wang, Dongsheng and Guo, Dandan and Zhao, He and Zheng, Huangjie and Tanwisuth, Korawat and Chen, Bo and Zhou, Mingyuan},
  journal={arXiv preprint arXiv:2203.01570},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
dataset		dataset
figure		figure
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Trainer.py		Trainer.py
Utils.py		Utils.py
cluster_clc.py		cluster_clc.py
dataloader.py		dataloader.py
main.py		main.py
model.py		model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

dataset

dataset

figure

figure

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Trainer.py

Trainer.py

Utils.py

Utils.py

cluster_clc.py

cluster_clc.py

dataloader.py

dataloader.py

main.py

main.py

model.py

model.py

Repository files navigation

WeTe: Representing mixtures of Word Embeddings with mixtures of Topic Embedding

Getting Started

Install

Dataset

Pretrained word embeddings

Training

Evaluation

Citation

About

Releases

Packages

Languages

License

BoChenGroup/WeTe

Folders and files

Latest commit

History

Repository files navigation

WeTe: Representing mixtures of Word Embeddings with mixtures of Topic Embedding

Getting Started

Install

Dataset

Pretrained word embeddings

Training

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Languages