forked from pielstroem/Topics
-
Notifications
You must be signed in to change notification settings - Fork 13
/
index.html
executable file
·188 lines (180 loc) · 12.5 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
<!DOCTYPE html>
<html lang="de">
<head>
<meta charset="utf-8">
<title>DARIAH-DE :: Topics – Easy Topic Modeling</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="author" content="DARIAH-DE">
<meta name="description" content="DARIAH-DE :: Demonstrator">
<!-- CSS Imports -->
<link rel="stylesheet" href="{{url_for('static', filename='css/bootstrap.css')}}" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="{{url_for('static', filename='css/bootstrap-responsive.css')}}" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="{{url_for('static', filename='css/application.css')}}" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="{{url_for('static', filename='css/bootstrap-customization.css')}}" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="{{url_for('static', filename='css/bootstrap-modal.css')}}" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="{{url_for('static', filename='css/font-awesome.css')}}">
<style>
div#loading {
width: 120px;
height: 120px;
position: absolute;
margin: auto;
top: 0;
right: 0;
bottom: 0;
left: 0;
display: none;
margin: auto;
background: url("{{url_for('static', filename='pie.gif')}}") no-repeat;
cursor: wait;
}
</style>
<!-- JavaScript files at the end for faster loading of documents -->
<script type="text/javascript" src="{{url_for('static', filename='js/jquery-1.8.2.js')}}"></script>
<script type="text/javascript" src="{{url_for('static', filename='js/bootstrap.js')}}"></script>
<script type="text/javascript" src="{{url_for('static', filename='js/globalmenu.js')}}"></script>
<script type="text/javascript">
function loading() {
$("#loading").show();
$("#content").hide();
}
</script>
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<link rel="shortcut icon" type="image/png" href="{{url_for('static', filename='img/page_icon.png')}}" />
</head>
<body>
<div id="loading"></div>
<div id="content">
<div class="navbar navbar-inverse navbar-static-top navbar-dariah" id="top">
<div class="navbar-inner">
<div class="container-fluid">
<div class="row-fluid">
<div class="span1"></div>
<div class="span10">
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</a>
<div class="nav-collapse collapse">
<ul class="nav pull-right">
</ul>
<ul class="nav">
<!--
Don't change this section!
-->
<li id="home_button" class="dropdown">
<a class="brand dropdown-toggle" data-toggle="dropdown" href="#">
<span class="caret"></span> DARIAH-DE
</a>
<ul id="home_dropdown_menu" class="dropdown-menu">
<li class="dropdown-submenu">
<a tabindex="-1" href="#">DARIAH-DE</a>
<ul class="dropdown-menu">
<li><a href="http://de.dariah.eu">DARIAH-DE Home</a>
</li>
<li class="divider"></li>
<li><a href="http://textgrid.de/ ">TextGrid</a>
</li>
</ul>
</li>
<li class="divider"></li>
<li class="dropdown-submenu">
<a tabindex="-1" href="#">DARIAH-EU</a>
<ul class="dropdown-menu">
<li><a href="http://www.dariah.eu/">DARIAH-EU Home</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
</div>
<div id="content_layout" class="container-fluid">
<div style="height: 70px;"></div>
<div class="row-fluid">
<div class="span10 offset1 main-content-wrapper no-margin">
<div id="content" class="primary-area">
<h1>Topics – Easy Topic Modeling</h1>
<div id="contentInner" style="text-align:justify">
<form action="/upload" method="POST" enctype="multipart/form-data">
<p>The text mining technique <b>Topic Modeling</b> has become a popular statistical method for clustering documents. This web application introduces an user-friendly workflow, basically containing data preprocessing, the actual topic modeling using <b>latent Dirichlet allocation</b> (LDA), which learns the relationships between words, topics and documents, as well as one interactive visualization to explore the model.</p>
<p>LDA, introduced in the context of text analysis in <a href="http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">2003</a>, is an instance of a more general class of models called <b>mixed-membership models</b>. Involving a number of
distributions and parameters, the topic model is typically performed using <a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs sampling</a> with conjugate priors and is purely based on word frequencies. There have been written numerous introductions to topic modeling for humanists (e.g. <a href="http://www.scottbot.net/HIAL/index.html@p=19113.html">this one</a>), which provide another level of detail regarding its technical and epistemic properties</p>
<p>For this workflow, you will need a corpus (a set of texts) as plain text (<b>.txt</b>) or <a href="http://www.tei-c.org/index.xml">TEI XML</a> (<b>.xml</b>). The <a href="https://textgridrep.org/">TextGrid Repository</a> is a great place to start searching for text data. Anyway, to demonstrate topic modeling, we provide one small text collection containing 15 diary excerpts, as well as 15 war diary excerpts, which appeared in <i>Die Grenzboten</i>, a German newspaper of the late 19th and early 20th century.</p>
<div class="alert alert-block">
<button type="button" class="close" data-dismiss="alert">×</button>
<i class="fa fa-exclamation-circle"></i> Of course, you can work with your own corpus, but this application aims for simplicity and usability. If you have a large corpus (let's say more than 200 documents with more than 5000 words per document), you may wish to use more sophisticated topic models such as those implemented in <a href="http://mallet.cs.umass.edu/topics.php">MALLET</a>, which is known to be more robust than standard LDA. Have a look at our Jupyter notebook introducing <a href="https://github.com/DARIAH-DE/Topics/blob/master/IntroducingMallet.ipynb">topic modeling with MALLET</a>.</div>
<br>
<h2>1. Preprocessing</h2>
<h3>1.1. Reading a corpus of documents</h3>
<p>Select plain text (<b>.txt</b>) or <a href="http://www.tei-c.org/index.xml">TEI XML</a> files (<b>.xml</b>).</p>
<input type="file" name="files" multiple><br><br>
<h3>1.2. Tokenize corpus</h3>
<p>Your text files will be tokenized. Tokenization is the task of cutting a stream of characters into linguistic units, simply words or, more precisely, <i>tokens</i>. Without identifying tokens, it is difficult to extract important information,
such as most frequent words, also known as <i>stopwords</i>, or words that occur only once in a document or corpus, called <i>hapax legomena</i>.
<h3>1.3. Feature selection and removal</h3> Stopwords and hapax legomena are harmful for LDA and have to be removed from the corpus. In case you want to determine stopwords individually based on your corpus, define a threshold for most frequent words in the following line.</p>
<div class="alert alert-info">
<button type="button" class="close" data-dismiss="alert">×</button>
<b>Tip:</b> Be careful with removing most frequent words, you might remove words quite important for LDA. Anyway, to gain better results, it is highly recommended to use an external stopwords list, e.g. <a href="https://raw.githubusercontent.com/DARIAH-DE/Topics/master/tutorial_supplementals/stopwords/en.txt">this one</a> for English corpora.
</div>
<input type="text" name="mfw_threshold" value="150">
<p>Alternatively, upload your own words-to-remove list here:</p>
<input type="file" name="stopword_list"><br><br>
<h2>2. Model creation</h2>
<p>The actual topic modeling is done with an external state-of-the-art LDA implementation. In this workflow, we are relying on the Python library <b><a href="http://pythonhosted.org/lda/index.html">lda</a></b> by <a href="https://ariddell.org">Allen Riddell</a>,
which is very lightweight and provides basic LDA.</p>
<h3>2.1. Generate LDA model</h3>
<div class="alert alert-block">
<button type="button" class="close" data-dismiss="alert">×</button>
<i class="fa fa-exclamation-circle"></i> This step can take quite a while! Meaning something between some seconds and some hours depending on corpus size and the number of iterations. Our example short corpus should be done within a
minute or two.</div>
<p>Set the number of topics to use in the following line. The best number depends on what you are looking for in the model – the default will provide a broad overview of the contents of the corpus.</p>
<input type="text" name="num_topics" value="10">
<p>Choose the number of iterations. The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model. The default value should produce quite good results, but feel free
to increase the number of iterations.</p>
<input type="text" name="num_iterations" value="5000"><br><br>
<h2>3. Model visualization</h2>
<p>When using topic models to explore text collections, we are typically interested in examining texts in terms of their constituent topics (instead of word frequencies). Because the number of topics is so much smaller than the number of
unique vocabulary elements (say, 10 versus 10,000), a range of data visualization methods become available. The visualization techniques provided in this workflow are not specific to topic models per se but rather fall into a more general
category for techniques for visualizing count data.</p><br>
<h2>4. Submitting Data</h2>
<p>Finally, submit your data and explore the model.</p>
<div class="alert alert-block">
<button type="button" class="close" data-dismiss="alert">×</button>
<i class="fa fa-exclamation-circle"></i> This application is still in development, so errors may occur. Feel free to write an email or go to the <a href="https://github.com/DARIAH-DE/Topics/issues">GitHub issues</a> page.</div>
<input type="submit" value="Send" onclick="loading();">
</form>
<hr>
<h2>Contact</h2>
<p><a href="mailto:pielstroem@biozentrum.uni-wuerzburg.de">Dr. Steffen Pielström</a>, University of Würzburg</p>
</div>
</div>
</div>
</div>
<div class="row-fluid">
<div id="footer" class="span10 offset1 no-margin footer">
<span>© 2017 DARIAH-DE</span>
<ul class="pull-right inline">
<li><a href="https://de.dariah.eu/impressum">Impressum</a>
</li>
<li><a href="https://wiki.de.dariah.eu/display/publicde/Cluster+5%3A+Quantitative+Datenanalyse">Contact</a>
</li>
</ul>
</div>
</div>
</div>
<noscript>
<div>Enable JavaScript!</div>
</noscript>
</div>
</body>
</html>