What is BigARTM?

The state-of-the-art platform for topic modeling.

What is BigARTM?

BigARTM is a powerful tool for topic modeling based on a novel technique called Additive Regularization of Topic Models. This technique effectively builds multi-objective models by adding the weighted sums of regularizers to the optimization criterion. BigARTM is known to combine well very different objectives, including sparsing, smoothing, topics decorrelation and many others. Such combination of regularizers significantly improves several quality measures at once almost without any loss of the perplexity.

References

Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M. BigARTM: Open Source Library for Regularized Multimodal Topic Modeling of Large Collections // Analysis of Images, Social Networks and Texts. 2015.
Vorontsov K., Frei O., Apishev M., Romov P., Dudarenko M., Yanina A. Non-Bayesian Additive Regularization for Multimodal Topic Modeling of Large Collections // Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications, October 19, 2015 - pp. 29-37.
Vorontsov K., Potapenko A., Plavin A. Additive Regularization of Topic Models for Topic Selection and Sparse Factorization. // Statistical Learning and Data Sciences. 2015 — pp. 193-202.
Vorontsov K. V., Potapenko A. A. Additive Regularization of Topic Models // Machine Learning Journal, Special Issue “Data Analysis and Intelligent Optimization”, Springer, 2014.
More publications can be found in our wiki page.

Related Software Packages

David Blei's List of Open Source topic modeling software
MALLET: Java-based toolkit for language processing with topic modeling package
Gensim: Python topic modeling library
Vowpal Wabbit has an implementation of Online-LDA algorithm

How to Use

Installing

Download binary release or build from source using cmake:

$ mkdir build && cd build
$ cmake ..
$ make install

Command-line interface

Check out documentation for bigartm.

Examples:

Basic model (20 topics, outputed to CSV-file, inferred in 10 passes)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --write-model-readable model.txt
--passes 10 --batch-size 50 --topics 20

Basic model with less tokens (filtered extreme values based on token's frequency)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20 --write-model-readable model.txt

Simple regularized model (increase sparsity up to 60-70%)

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics 20  --write-model-readable model.txt 
--regularizer "0.05 SparsePhi" "0.05 SparseTheta"

More advanced regularize model, with 10 sparse objective topics, and 2 smooth background topics

bigartm.exe -d docword.kos.txt -v vocab.kos.txt --dictionary-max-df 50% --dictionary-min-df 2
--passes 10 --batch-size 50 --topics obj:10;background:2 --write-model-readable model.txt
--regularizer "0.05 SparsePhi #obj"
--regularizer "0.05 SparseTheta #obj"
--regularizer "0.25 SmoothPhi #background"
--regularizer "0.25 SmoothTheta #background"

Interactive Python interface

BigARTM supports full-featured and clear Python API (see Installation to configure Python API for your OS).

Example:

import artm

# Prepare data
# Case 1: data in CountVectorizer format
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from numpy import array

cv = CountVectorizer(max_features=1000, stop_words='english')
n_wd = array(cv.fit_transform(fetch_20newsgroups().data).todense()).T
vocabulary = cv.get_feature_names()

bv = artm.BatchVectorizer(data_format='bow_n_wd',
                          n_wd=n_wd,
                          vocabulary=vocabulary)

# Case 2: data in UCI format (https://archive.ics.uci.edu/ml/datasets/Bag+of+Words)
bv = artm.BatchVectorizer(data_format='bow_uci',
                          collection_name='kos',
                          target_folder='kos_batches')

# Learn simple LDA model (or you can use advanced artm.ARTM)
model = artm.LDA(num_topics=15, dictionary=bv.dictionary)
model.fit_offline(bv, num_collection_passes=20)

# Print results
model.get_top_tokens()

Refer to tutorials for details on how to start using BigARTM from Python, user's guide can provide information about more advanced features and cases.

Low-level API

Contributing

Refer to the Developer's Guide and follows Code Style.

To report a bug use issue tracker. To ask a question use our mailing list. Feel free to make pull request.

License

BigARTM is released under New BSD License that allowes unlimited redistribution for any purpose (even for commercial use) as long as its copyright notices and the license’s disclaimers of warranty are maintained.

Name		Name	Last commit message	Last commit date
Latest commit History 1,557 Commits
3rdparty		3rdparty
cmake_modules		cmake_modules
csharp		csharp
docs		docs
python		python
src		src
test_data		test_data
utils		utils
.gitignore		.gitignore
.travis.yml		.travis.yml
CMakeLists.txt		CMakeLists.txt
LICENSE.txt		LICENSE.txt
README.md		README.md
appveyor-mingw.yml		appveyor-mingw.yml
appveyor.yml		appveyor.yml
codestyle_checks.sh		codestyle_checks.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is BigARTM?

References

Related Software Packages

How to Use

Installing

Command-line interface

Interactive Python interface

Low-level API

Contributing

License

About

Releases

Packages

Languages

License

Seriont/bigartm

Folders and files

Latest commit

History

Repository files navigation

What is BigARTM?

References

Related Software Packages

How to Use

Installing

Command-line interface

Interactive Python interface

Low-level API

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages