forked from pielstroem/Topics
-
Notifications
You must be signed in to change notification settings - Fork 13
/
index.html
175 lines (168 loc) · 10.4 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
<!DOCTYPE html>
<html lang="de">
<head>
<meta charset="utf-8">
<title>DARIAH-DE :: Demonstrator</title>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="author" content="DARIAH-DE">
<meta name="description" content="DARIAH-DE :: Demonstrator">
<!-- CSS Imports -->
<link rel="stylesheet" href="{{url_for('static', filename='css/bootstrap.css')}}" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="{{url_for('static', filename='css/bootstrap-responsive.css')}}" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="{{url_for('static', filename='css/application.css')}}" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="{{url_for('static', filename='css/bootstrap-customization.css')}}" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="{{url_for('static', filename='css/bootstrap-modal.css')}}" type="text/css" media="screen, projection" />
<link rel="stylesheet" href="{{url_for('static', filename='css/font-awesome.css')}}">
<style>div#loading {
width: 120px;
height: 120px;
position: absolute;
margin: auto;
top: 0;
right: 0;
bottom: 0;
left: 0;
display: none;
margin: auto;
background: url(/static/pie.gif) no-repeat;
cursor: wait;
}
</style>
<!-- JavaScript files at the end for faster loading of documents -->
<script type="text/javascript" src="{{url_for('static', filename='js/jquery-1.8.2.js')}}"></script>
<script type="text/javascript" src="{{url_for('static', filename='js/bootstrap.js')}}"></script>
<script type="text/javascript" src="{{url_for('static', filename='js/globalmenu.js')}}"></script>
<script type="text/javascript">
function loading(){
$("#loading").show();
$("#content").hide();
}
</script>
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<link rel="shortcut icon" type="image/png" href="{{url_for('static', filename='img/page_icon.png')}}" />
</head>
<body>
<div id="loading"></div>
<div id="content">
<div class="navbar navbar-inverse navbar-static-top navbar-dariah" id="top">
<div class="navbar-inner">
<div class="container-fluid">
<div class="row-fluid">
<div class="span1"></div>
<div class="span10">
<a class="btn btn-navbar" data-toggle="collapse" data-target=".nav-collapse">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</a>
<div class="nav-collapse collapse">
<ul class="nav pull-right">
</ul>
<ul class="nav">
<!--
Don't change this section!
-->
<li id="home_button" class="dropdown">
<a class="brand dropdown-toggle" data-toggle="dropdown" href="#">
<span class="caret"></span> DARIAH-DE
</a>
<ul id="home_dropdown_menu" class="dropdown-menu">
<li class="dropdown-submenu">
<a tabindex="-1" href="#">DARIAH-DE</a>
<ul class="dropdown-menu">
<li><a href="http://de.dariah.eu">DARIAH-DE Home</a>
</li>
<li class="divider"></li>
<li><a href="http://textgrid.de/ ">TextGrid</a>
</li>
</ul>
</li>
<li class="divider"></li>
<li class="dropdown-submenu">
<a tabindex="-1" href="#">DARIAH-EU</a>
<ul class="dropdown-menu">
<li><a href="http://www.dariah.eu/">DARIAH-EU Home</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
</div>
</div>
</div>
</div>
</div>
<div id="content_layout" class="container-fluid">
<div style="height: 70px;"></div>
<div class="row-fluid">
<div class="span10 offset1 main-content-wrapper no-margin">
<div id="content" class="primary-area">
<h1>Demonstrator: Topic Modeling</h1>
<div id="contentInner" style="text-align:justify">
<form action="/upload" method="POST" enctype="multipart/form-data">
<p>The text mining technique <b>Topic Modeling</b> has become a popular statistical method for clustering documents. This web application introduces an user-friendly workflow, basically containing data pre-processing, an implementation of the prototypic topic model <b>Latent Dirichlet Allocation</b> (LDA) which learns the relationships between words, topics, and documents, as well as an evaluation measure and visualization options to explore the trained LDA model.</p>
<h2>1. Preprocessing</h2>
<h3>1.1 Reading a Corpus of Documents</h3>
<p>Upload some plain text or TEI encoded XML files.</p>
<input type="file" name="files" multiple>
<br>
<br>
<h3>1.2 Tokenize Corpus</h3>
<p>Tokenization is the task of cutting a stream of characters into linguistic units, simply words or, more precisely, <i>tokens</i>. Without identifying tokens, it is difficult to extract important information, such as most frequent words, also known as <i>stopwords</i>, or words that occure only once in a document, called <i>hapax legomena</i>.
<h3>1.3 Feature Selection and/or Removal</h3>
Stopwords and hapax legomena are harmful for the LDA model and have to be removed from the corpus. In case you want to determine stopwords individually based on your text files, define a threshold in the following line.</p>
<div class="alert alert-info">
<button type="button" class="close" data-dismiss="alert">×</button>
<b>Tip:</b> If your corpus is large enough (at least about 10 documents, e.g. short stories), try removing the top 100 most frequent words. If you are unsure about the threshold, better use an external stopwords list, e.g. <a href="https://raw.githubusercontent.com/DARIAH-DE/Topics/testing/tutorial_supplementals/stopwords/en.txt">this one</a> for English text.
</div>
<input type="text" name="mfws" value="100">
<p>Alternatively, upload your own words-to-remove list here:</p>
<input type="file" name="stoplist">
<br>
<h2>2. Model Creation</h2>
<p>This workflow contains the LDA implementation of the open-source toolkit <a href="https://radimrehurek.com/gensim/">Gensim</a>. Since 2008, Gensim was used and cited in over 400 commercial and academic applications.</p>
<h3>2.1 Generate LDA Model</h3>
<div class="alert alert-block">
<button type="button" class="close" data-dismiss="alert">×</button>
<i class="fa fa-exclamation-circle"></i> This step can take quite a while! Meaning something between some seconds and some hours depending on corpus size and the number of passes. Our example short stories corpus should be done within a minute or two.</div>
<p>Set the number of topics in the following line.</p>
<input type="text" name="number_topics" value="10">
<p>Now, define the number of passes: The higher, the better.</p>
<input type="text" name="passes" value="10">
<h2>3. Model Visualization</h3>
<h3>3.1 The Document-Topic-Matrix in a Heatmap</h3>
<p>This visualization displays the kind of information that is probably most useful to literary scholars. Going beyond pure exploration, this visualization can be used to show thematic developments over a set of texts as well as a single text, akin to a dynamic topic model.
<h2>4. Submitting</h2>
<p>Finally, submit your data and explore the model.</p>
<input type="submit" value="Send" onclick="loading();">
</form>
<hr>
<h2>Contact</h2>
<p><a href="mailto:pielstroem@biozentrum.uni-wuerzburg.de">Dr. Steffen Pielström</a>, University of Würzburg</p>
</div>
</div>
</div>
</div>
<div class="row-fluid">
<div id="footer" class="span10 offset1 no-margin footer">
<span>© 2017 DARIAH-DE</span>
<ul class="pull-right inline">
<li><a href="https://de.dariah.eu/impressum">Impressum</a>
</li>
<li><a href="https://wiki.de.dariah.eu/display/publicde/Cluster+5%3A+Quantitative+Datenanalyse">Contact</a>
</li>
</ul>
</div>
</div>
</div>
<noscript>
<div>Enable JavaScript!</div>
</noscript>
</div>
</body>
</html>