-
Notifications
You must be signed in to change notification settings - Fork 10
/
model.html
executable file
·100 lines (97 loc) · 7.67 KB
/
model.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
{% extends "layout.html" %}
{% block navigation %}
<ul class="nav pull-right">
<li>
<a href="{{ url_for('index') }}"> <i class="icon-refresh icon-white"></i> Reset</a>
</li>
<li>
<a href="{{ url_for('help') }}"> <i class="icon-question-sign icon-white"></i> Help</a>
</li>
<li>
<a href="{{ url_for('download') }}"> <i class="icon-download icon-white"></i> Save Graphics and Tables</a>
</li>
</ul>
{% endblock %}
{% block bokeh_scripts %}
{{ corpus_boxplot_script|safe }}
{{ heatmap_script|safe }}
{{ topics_script|safe}}
{{ documents_script|safe }}
{% endblock %}
{% block content %}
<h1>Topics – Easy Topic Modeling</h1>
<div id="contentInner" style="text-align:justify;">
<h2>1. Corpus and Parameter Summary</h2>
<p>All parameter settings are summarized in the following table, among others. This kind of information could be useful if you want to create more than one topic model and compare the results. The most common way to evaluate a probabilistic model is to calculate the so-called <b>log-likelihood</b>. If you increase the number of iterations, you will see that not only your topics get better and better, but also the log-likelihood will increase up to a <b>certain point</b>. This way you could find the ideal number of iterations.
{% for table in parameter %}
{{ table|safe }}
{% endfor %}
<br>
As you can see, your corpus is much smaller after cleaning. You {{ cleaning|safe }}. In addition so-called <i>hapax legomena</i> have been removed. In corpus linguistics, a hapax legomenon is a word that occurs only once within a context. So, if a word occurs only once in a document, it is very likely that the word is semantically insignificant – that is, not useful for the model.
<br>
<br>
<center>{{ corpus_boxplot_div|safe }}</center>
</p>
<br>
<div class="alert alert-success">
<button type="button" class="close" data-dismiss="alert">×</button>
<b>FYI:</b> All tables and graphics shown here are available for <b>download as ZIP archive</b>. Just use the button in the toolbar.
</div>
<h2>2. Inspecting the Topic Model</h2>
<p>Topic models are unsupervised. It is called <i>unsupervised</i>, because you did not have any labels describing the semantic structures of your documents, but only pure word frequencies. Therefore, there is no automatic evaluation of how <i>good</i> the topics are. So, it is up to you by inspecting the model to decide whether you are satisfied with your models’ performance or not.
<div class="alert alert-info">
<button type="button" class="close" data-dismiss="alert">×</button>
<b>Tip:</b> The quantitative evaluation of topics (that is, a list of words as seen below) is a pretty challenging task. <b>Pointwise Mutual Information</b> (PMI) is one possibility to evaluate the semantic coherence of topics. We implemented two variants of PMI in the programming language Python, which is available via GitHub.
</div>
</p>
<h3>2.1. Topics</h3>
<p>Each topic is a probability distribution over the vocabulary found in the corpus. The top words (so-called <i>keys</i>) shown in the table below are those words <b>most likely to be found in each topic</b>, and describe the semantic structures of your corpus – ideally in a meaningful way. Lists of the top keys associated with each topic are often all that is needed when the corpus is large and the inferred topics make sense in light of prior knowledge of the corpus.</p>
<br>
{% for table in topics %}
{{ table|safe }}
{% endfor %}
<br>
<h3>2.2. Topics and Documents</h3>
<p>Each document <i>consists</i> to a certain extent of each topic (this is the theoretical assumption of topic models). The proportions can be visualized in a heatmap. This displays the kind of information that is <b>probably most useful to literary scholars</b>. Going beyond pure exploration, this visualization can be used to show <b>thematic developments</b> over a set of texts as well as a single text, akin to a dynamic topic model. What also can become apparent here, is that some topics correlate highly with a <b>specific author or group of authors</b>, while other topics correlate highly with a <b>specific text or group of texts</b>. All in all, this displays two of LDA’s properties – its use as a <b>distant reading tool</b> that aims to get at text meaning, and its use as a provider of data that can be further used in computational analysis, such as document classification or authorship attribution.</p>
{{ heatmap_div|safe }}
<br>
<br>
<h3>2.3. Distribution of Topics</h3>
<p>In the following graphic you can access <i>one</i> dimension of the information displayed in the heatmap above. This might be a more clear approach, if you are interested in a specific topic, or, more precisely, how the topic is distributed over the documents of your corpus. You can use the widget to select a specific topic.
</p>
<div class="alert alert-success">
<button type="button" class="close" data-dismiss="alert">×</button>
<b>FYI:</b> The proportions you can see here by default are those of the first topic: <b>{{ first_topic|safe }}</b>. But you can of course take a closer look at each topic by using the widget.
</div>
{% if autocomplete_warning_t|safe == "include" %}
<div class="alert alert-danger">
<button type="button" class="close" data-dismiss="alert">×</button>
<b>Watch out!</b> The autocompletion is still a bit buggy and <b>serves more as a writing aid</b>. If you click on the suggestion, not much happens yet. However, the text field must contain the complete name and <b>you have to press enter</b>, otherwise it will not work. Sorry. But we're working on it.
</div>
{% endif %}
<div class="bars">
{{ topics_div|safe }}
<br>
</div>
<h3>2.4. Distribution of Documents</h3>
<p>Similar to the above barchart, you can access the <i>other</i> dimension displayed in the heatmap. So, if you are intereseted in a specific <i>document</i>, you have the ability to select it via the widget and inspect its proportions.
</p>
<div class="alert alert-success">
<button type="button" class="close" data-dismiss="alert">×</button>
<b>FYI:</b> The proportions you can see here by default are those of the first document: <b>{{ first_document|safe }}</b>. Here you can also have a closer look at the distribution of the topics for each document using the widget.
</div>
{% if autocomplete_warning_d|safe == "include" %}
<div class="alert alert-danger">
<button type="button" class="close" data-dismiss="alert">×</button>
<b>Watch out!</b> The autocompletion is still a bit buggy and <b>serves more as a writing aid</b>. If you click on the suggestion, not much happens yet. However, the text field must contain the complete name and <b>you have to press enter</b>, otherwise it will not work. Sorry. But we're working on it.
</div>
{% endif %}
<div class="bar">
{{ documents_div|safe }}
<br>
</div>
<h2>2. Delving Deeper into Topic Modeling</h2>
<p>We want to introduce users with little or no programming experience to digital methods. If this little insight into the text mining technique topic modeling has aroused your interest, and you want to delve deeper into the <b>technical parts</b>, we provide the same convenient, modular workflow which can be entirely controlled from within a well documented Jupyter notebook, integrating a total of three popular LDA implementations.</p>
<p>All resources are available via GitHub. To prevent dead links in this application, it is probably safer if you search the internet for the GitHub repository yourself. The name of the organization is <b>DARIAH-DE</b>, the name of the repository <b>Topics</b>.</p>
</div>
{% endblock %}