BP-LDA

Backpropagation Latent Dirichlet Allocation (a third-party reimplementation of paper "End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture" by Jianshu Chen et al.)

The paper is accepted by NIPS 2015. Link to this paper: http://papers.nips.cc/paper/5967-end-to-end-learning-of-lda-by-mirror-descent-back-propagation-over-a-deep-architecture.pdf

To run the codes, there are two executables (one depends on the other). You need to copy both executables to the same working directory.

bp-lda/BP_sLDA/bin/Release/BP_sLDA.exe # for supervised task

bp-lda/BP_sLDA/bin/Release/BP_LDA.exe # for unsupervised task

For Windows users, simply open a command prompt window and run "BP_sLDA.exe" or "BP_LDA.exe".

For Linux/Mac users, you need to install mono (http://www.mono-project.com/). Then, open a terminal and run "mono BP_sLDA.exe" or "mono BP_LDA.exe".

Regression demo using Amazon Movie Review data set, 1% data, vocabulary size 5000

Raw data can be downloaded from https://snap.stanford.edu/data/web-Movies.html

Data: data_AmazonMovieReview_1percent/

.label: 1~5 star rating, shifted to zero mean

.feature: each line is a document in bag-of-words representation. Colon is used to delimit word index and word counts. Tab is used to delimit different words.

Command (for simplicity, we omit the paths to .exe and data files. Make sure you specify the correct paths to you executables and data files when you experiment)

Supervised (alpha = 1.001):

BP_sLDA.exe --nHid 5 --nHidLayer 10 --nInput 5000 --nOutput 1 --OutputType linearQuad --alpha 1.001 --nEpoch 50 --BatchSize 1000 --mu_Phi 0.01 --nSamplesPerDisplay 10000 --TrainLabelFile train.label --TestLabelFile test.label --TrainInputFile train.feature --TestInputFile test.feature --ResultFile result_Voc5000 --ThreadNum 32 --MaxThreadDeg 32

Supervised (alpha = 0.1):

BP_sLDA.exe --nHid 5 --nHidLayer 10 --nInput 5000 --nOutput 1 --OutputType linearQuad --alpha 0.1 --nEpoch 50 --BatchSize 1000 --mu_Phi 0.0001 --nSamplesPerDisplay 10000 --TrainLabelFile train.label --TestLabelFile test.label --TrainInputFile train.feature --TestInputFile test.feature --ResultFile result_Voc5000 --ThreadNum 32 --MaxThreadDeg 32

Output files (if you follow the above example)

result_Voc5000.model.Phi and result_Voc5000.model.U: model files with model parameters Phi and U, as described in the paper

result_Voc5000.perf and result_Voc5000.testscore: performance file and test score file

Classification demo using Multidomain Sentiment Classification data set, vocabulary size 1000

Raw data can be downloaded from https://www.cs.jhu.edu/~mdredze/datasets/sentiment/

Data: data_MultidomainSentiment/

.label: 0~1 binary class labels

.feature: each line is a document in bag-of-words representation. Colon is used to delimit word index and word counts. Tab is used to delimit different words.

Command

Supervised (alpha = 1.001):

BP_sLDA.exe --nHid 5 --nHidLayer 10 --nInput 1000 --nOutput 2 --OutputType softmaxCE --alpha 1.001 --nEpoch 20 --BatchSize 100 --mu_Phi 0.01 --nSamplesPerDisplay 10000 --TrainLabelFile train.label --TestLabelFile test.label --TrainInputFile train.feature --TestInputFile test.feature --ResultFile result_Voc1000 --ThreadNum 32 --MaxThreadDeg 32

Supervised (alpha = 0.1):

BP_sLDA.exe --nHid 5 --nHidLayer 10 --nInput 1000 --nOutput 2 --OutputType softmaxCE --alpha 0.1 --nEpoch 20 --BatchSize 100 --mu_Phi 0.001 --nSamplesPerDisplay 10000 --TrainLabelFile train.label --TestLabelFile test.label --TrainInputFile train.feature --TestInputFile test.feature --ResultFile result_Voc1000 --ThreadNum 32 --MaxThreadDeg 32

Output files

Same as the regression example.

Unsupervised demo using Amazon Movie Review data set, 1% data, vocabulary size 5000

Data: data_AmazonMovieReview_1percent/

Command

Unsupervised (alpha = 1.001):

BP_LDA.exe --nHid 5 --nHidLayer 10 --nInput 5000 --alpha 1.001 --nEpoch 20 --BatchSize 1000 --flag_DumpFeature true --mu_Phi 0.01 --nSamplesPerDisplay 10000 --TrainInputFile train.feature --TestInputFile test.feature --ResultFile result_Voc5000 --ThreadNum 32 --MaxThreadDeg 32

Unsupervised (alpha = 0.1):

BP_LDA.exe --nHid 5 --nHidLayer 10 --nInput 5000 --alpha 0.1 --nEpoch 20 --BatchSize 1000 --flag_DumpFeature true --mu_Phi 0.0001 --nSamplesPerDisplay 10000 --TrainInputFile train.feature --TestInputFile test.feature --ResultFile result_Voc5000 --ThreadNum 32 --MaxThreadDeg 32

Output files:

result_Voc5000.train.fea and result_Voc5000.test.fea: generated topic distribution vectors for each train/test documents.

Here is a brief explanation on command line arguments:

--nHid: Number of topics

--nHidLayer: Number of layers

--nInput: Vocabulary size

--nOutput: Number of output classes

--OutputType: “softmaxCE” means classification with softmax and cross entropy, "linearQuad" means linear quadratic (L2) loss

--alpha: Dirichlet parameter of the topics

--beta: Dirichlet parameter of the topic-word probability

--nEpoch: number of training epochs

--BatchSize: Minibatch size (number of documents at each mini-batch)

--BatchSizeSchedule: a scheduler for setting training batch size, e.g. 1:10,2:100,11:1000 means batch size = 10 in epoch 1, switching to batch size = 100 in epoch 2, and switching to 1000 in epoch 11.

--BatchSize_Test: Minibatch size for testing

--flag_DumpFeature: if true, then save the topic distribution of each document. If false, then do not save

--mu_Phi: learning rate for the topic-word probability matrix

--mu_U: learning rate for the topic to output matrix

--LearnRateSchedule “Constant” for constant learning rate

--nSamplesPerDisplay: “10000” means displaying the progress of training after every 10000 documents

--nEpochPerSave: The frequency of model saving

--nEpochPerTest: The frequency of testing the model

--nEpochPerDump: The frequency of dumping the topic distribution for each document (if –flag_DumpFeature is true)

--TrainLabelFile: Label file for training set (one column, each row represents the index of the class, starting from 0. For example, 0 means the 0-th class, 3 means the 3rd class)

--TestLabelFile: Label file for test set

--ValidLabelFile: Label file for validation set

--TrainInputFile: Input file for training set (tab separated file, each row means the bag-of-words vector of the document. For example, 0:3 \t 5:7 means that this document has the 0-th word occurred three times and has the 5th word occurred 7 times)

--TestInputFile: Input file for the test set

--ValidInputFile Input file for the validation set

--ResultFile: Name of the result file. (Name for the model file, log file, etc, will be the same as this one but with different extensions)

--ThreadNum: Number of threads

--MaxThreadDeg: Threads parameter

--DebugLevel: high/low/medium

--flag_RunningAvg: true to smooth the training process (more stable model)

--flag_SaveAllModels: true means saving models at different epochs separately, and false means only saving the model at the last epoch

References

Chen, Jianshu, et al. End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture. Advances in Neural Information Processing Systems. 2015.
J. McAuley and J. Leskovec. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. WWW, 2013.
John Blitzer, Mark Dredze, Fernando Pereira. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification. Association of Computational Linguistics (ACL), 2007.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
BP_LDA		BP_LDA
BP_sLDA		BP_sLDA
data_AmazonMovieReview_1percent		data_AmazonMovieReview_1percent
data_MultidomainSentiment		data_MultidomainSentiment
LDA_MirrorDescentBackProp.sln		LDA_MirrorDescentBackProp.sln
LDA_MirrorDescentBackProp.userprefs		LDA_MirrorDescentBackProp.userprefs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BP_LDA

BP_LDA

BP_sLDA

BP_sLDA

data_AmazonMovieReview_1percent

data_AmazonMovieReview_1percent

data_MultidomainSentiment

data_MultidomainSentiment

LDA_MirrorDescentBackProp.sln

LDA_MirrorDescentBackProp.sln

LDA_MirrorDescentBackProp.userprefs

LDA_MirrorDescentBackProp.userprefs

LICENSE

LICENSE

README.md

README.md

Repository files navigation

BP-LDA

Regression demo using Amazon Movie Review data set, 1% data, vocabulary size 5000

Data: data_AmazonMovieReview_1percent/

Command (for simplicity, we omit the paths to .exe and data files. Make sure you specify the correct paths to you executables and data files when you experiment)

Output files (if you follow the above example)

Classification demo using Multidomain Sentiment Classification data set, vocabulary size 1000

Data: data_MultidomainSentiment/

Command

Output files

Unsupervised demo using Amazon Movie Review data set, 1% data, vocabulary size 5000

Data: data_AmazonMovieReview_1percent/

Command

Output files:

Here is a brief explanation on command line arguments:

References

About

Releases

Packages

Languages

License

chenjianshu/bp-lda

Folders and files

Latest commit

History

Repository files navigation

BP-LDA

Regression demo using Amazon Movie Review data set, 1% data, vocabulary size 5000

Data: data_AmazonMovieReview_1percent/

Command (for simplicity, we omit the paths to .exe and data files. Make sure you specify the correct paths to you executables and data files when you experiment)

Output files (if you follow the above example)

Classification demo using Multidomain Sentiment Classification data set, vocabulary size 1000

Data: data_MultidomainSentiment/

Command

Output files

Unsupervised demo using Amazon Movie Review data set, 1% data, vocabulary size 5000

Data: data_AmazonMovieReview_1percent/

Command

Output files:

Here is a brief explanation on command line arguments:

References

About

Resources

License

Stars

Watchers

Forks

Languages