SITS

This is an implementation of the Gibbs samplers for SITS models (both parametric and nonparametric versions) described in Nguyen et al. (ACL 2012). For more information about the model, please refer to the paper.

@inproceedings{Nguyen:Boyd-Graber:Resnik-2012,
	Author = {Viet-An Nguyen and Jordan Boyd-Graber and Philip Resnik},
	Booktitle = {Association for Computational Linguistics},
	Year = {2012},
	Location = {Jeju, South Korea},
	Title = {{SITS}: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations},
}

An extended version (with more details and experiments) was published in this article in Machine Learning journal:

@article{Nguyen:Boyd-Graber:Resnik:Cai:Midberry:Wang-2014,
	Publisher = {Springer},
	Title = {Modeling Topic Control to Detect Influence in Conversations using Nonparametric Topic Models},
	Booktitle = {Machine Learning},
	Author = {Viet-An Nguyen and Jordan Boyd-Graber and Philip Resnik and Deborah Cai and Jennifer Midberry and Yuanxin Wang},
	Year = {2014},
	Volume = {95},
  	Number = {3},
  	Pages = {381--421},
}

Compile

To compile: ant compile
To make a clean build: ant clean-build
To makr the jar file: ant jar

Please refer to the file build.xml for additional options.

Input Data

SITS takes as inputs a set of conversations, each has multiple turns, each of which is a maximal uninterrupted utterance by one speaker. Currently, SITS accepts the following files:

<dataset>.words: contains the main texts in the following format:

  <num-conversations>\n
  <total-num-turns>\n
  <num-words-conv-1-turn-1>\t<word-1> <word-2> ...\n
  <num-words-conv-1-turn-2>\t<word-1> <word-2> ...\n
  ...\n
  <num-words-conv-1-turn-T1>\t<word-1> <word-2> ...\n
  \n
  <num-words-conv-2-turn-1>\t<word-1> <word-2> ...\n
  <num-words-conv-2-turn-2>\t<word-1> <word-2> ...\n
  ...\n
  <num-words-conv-2-turn-T2>\t<word-1> <word-2> ...\n

Here a blank line is used to separate two conversations. Each word is an index in the word vocabulary stored in file <dataset>.voc.

<dataset>.show: contains the conversation name for each turn. The number of lines in this file is equal to the number of turns in <dataset>.words
<dataset>.authors: contains the speaker of each turn. Each speaker is an index in the speaker vocabulary, stored in file <dataset>.whois
<dataset>.voc: contains the word vocabulary
<dataset>.whois: contains the speaker vocabulary
<dataset>.text: contains the raw texts

An example of a formatted data is also included in folder data.

Run models

Parametric SITS

java -cp 'dist/sits.jar:lib/*' segmentation.TopicSegmentation --dataset <dataset> --input <format_folder> --output <output_folder> --model param -v

Here are the arguments:

<dataset>: name of the dataset, which is also the file name in the formatted folder (see above).
<format_folder>: path to the folder containing the formatted data.
<output_folder>: path to the folder to store the output
burnIn: number of iterations during the burn-in period (default: 2500)
maxIter: maximum number of iterations (default: 5000)
sampleLag: lag between samples (default: 100)
K: number of topics (default: 25)
alpha: Dirichlet parameter for documents' topic distribution (default: 0.1)
beta: Dirichlet parameter for topics' word distribution (default: 0.1)
gamma: Beta parameter for speakers' topic shift distribution (default: 0.25)

Example

java -cp 'dist/sits.jar:lib/*' segmentation.TopicSegmentation --dataset debate2008 --input data/debate2008/ldaformat/ --output data/segmentation/debate2008/ --burnIn 100 --maxIter 5000 --sampleLag 50 --gamma 2.5 --model param -v --alpha 0.1 --beta 0.1

Nonparametric SITS

java -cp 'dist/sits.jar:lib/*' segmentation.TopicSegmentation --dataset <dataset> --input <format_folder> --output <output_folder> --model non-param -v

Here are the arguments:

<dataset>: name of the dataset, which is also the file name in the formatted folder (see above).
<format_folder>: path to the folder containing the formatted data.
<output_folder>: path to the folder to store the output
burnIn: number of iterations during the burn-in period (default: 2500)
maxIter: maximum number of iterations (default: 5000)
sampleLag: lag between samples (default: 100)
K: initial number of topics (default: 25)
alpha, alpha_0, alpha_C: Dirichlet process parameter for documents' topic distribution (default: 0.1)
beta: Dirichlet parameter for topics' word distribution (default: 0.1)
gamma: Beta parameter for speakers' topic shift distribution (default: 0.25)

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data/debate2008/ldaformat		data/debate2008/ldaformat
lib		lib
src/segmentation		src/segmentation
LICENSE		LICENSE
README.md		README.md
build.xml		build.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/debate2008/ldaformat

data/debate2008/ldaformat

lib

lib

src/segmentation

src/segmentation

LICENSE

LICENSE

README.md

README.md

build.xml

build.xml

Repository files navigation

SITS

Compile

Input Data

Run models

Parametric SITS

Nonparametric SITS

About

Releases

Packages

Languages

License

vietansegan/sits

Folders and files

Latest commit

History

Repository files navigation

SITS

Compile

Input Data

Run models

Parametric SITS

Nonparametric SITS

About

Resources

License

Stars

Watchers

Forks

Languages