Skip to content

Commit

Permalink
feat: new content
Browse files Browse the repository at this point in the history
  • Loading branch information
severinsimmler committed Nov 25, 2018
1 parent 0069a01 commit 44aa518
Show file tree
Hide file tree
Showing 2 changed files with 59 additions and 61 deletions.
Binary file removed assets/screenshot.png
Binary file not shown.
120 changes: 59 additions & 61 deletions index.html
Expand Up @@ -372,7 +372,7 @@ <h2 id="getting started">Getting started</h2>
<li>Run the app by double-clicking the file
<strong>DARIAH Topics Explorer</strong>.</li>
</ol>
<p>You can also use the developer version (if you are not on Windows):</p>
<p>You can also use the most recent source code (if you are not on Windows):</p>
<ol class="arabic simple">
<li>Go to the
<a class="reference external" href="https://github.com/DARIAH-DE/TopicsExplorer/releases/latest">release-section</a>
Expand All @@ -389,7 +389,6 @@ <h2 id="getting started">Getting started</h2>
<li>To start the application, type
<code>python topicsexplorer.py</code>, and press enter.</li>
</ol>
<br>
<div>
<h4 class="highlight center" style="padding-bottom: 0px; text-align: left;">Note</h4>
<p class="highlight center" style="text-align: left;">If you want to use the sample corpus, you
Expand All @@ -402,34 +401,17 @@ <h4 class="highlight center" style="padding-bottom: 0px; text-align: left;">Note
release section,
the corpus is included.</p>
</div>

<br />

<h2 id="application">The application</h2>
<img style="width: 800px; margin: 0 auto;" alt="Screenshot of DARIAH Topics Explorer" src="assets/screenshot.png" />
<p>Topics Explorer aims for
<strong>simplicity and usability</strong>. If you are working with a large corpus (let’s say more
than 200
documents, 5000
words each document), you may wish to use more sophisticated topic models such as those implemented
in
<a class="reference external" href="http://mallet.cs.umass.edu/topics.php">MALLET</a>, which is
known to be more
robust than standard LDA. Have a look at our Jupyter notebook introducing
<a class="reference external" href="https://github.com/DARIAH-DE/Topics/blob/master/notebooks/IntroducingMallet.ipynb">topic
modeling with MALLET</a>.</p>


<h2 id="visualization">Example visualization</h2>
<p>The following visualization is based on the distribution of 10 topics over a total of 10 novels
(written by
Charles
Dickens, George Eliot, Joseph Fielding, William Thackeray, and Anthony Trollope). But first of all,
the
algorithm
produces so-called topics:</p>
<p>
<center>
<img style="width: 800px; margin: 0 auto;" alt="Screenshot of DARIAH Topics Explorer" src="assets/application-screenshot.png" />
<p>This application is designed to introduce topic modeling particularly gently (e.g. for educational purpose).
If you have a very large text corpus, you may wish to use more <i>powerful</i> tools like <a href="http://mallet.cs.umass.edu/topics.php">MALLET</a>, which is written in Java and can be completely controlled from the command-line.
The topic modeling algorithm used in this application, <i>latent Dirichlet allocation</i>, was implemented
by <a href="https://www.ariddell.org">Allen B. Riddell</a> using collapsed Gibbs sampling as described in <a href="http://www.genetics.org/content/155/2/945.full">Pritchard et al. (2000)</a>.</p>
<p>You might want to check out some <a href="https://github.com/DARIAH-DE/Topics/tree/master/notebooks">Jupyter notebooks</a> for topic modeling in Python – experimenting with an example corpus on
<a href="https://mybinder.org/v2/gh/DARIAH-DE/Topics/master?filepath=notebooks%2FIntroducingLda.ipynb">Binder</a> does not require any software on your local machine.</p>

<h2>The sample corpus</h2>
<p>We provide a small sample corpus with which the application can be tested quickly and easily:
<table border="0" class="dataframe">
<thead>
<tr style="text-align: right;">
Expand Down Expand Up @@ -524,12 +506,27 @@ <h2 id="visualization">Example visualization</h2>
</tr>
</tbody>
</table>
</center>


</p>

<h2 id="visualization">Example visualizations</h2>
<p>The following visualizations display the topic model output of 10 novels (written by Charles Dickens, George Eliot, Joseph Fielding, William Thackeray and Anthony Trollope).</p>

<div style="margin-top: 25px; margin-bottom: 30px;">
<p class="highlight center" style="text-align: left;"><b>Topics Explorer’s visualizations are interactive</b>. You will be able to navigate through topics and documents, get similar topics and documents displayed, read excerpts from the original texts, and inspect the document-topic distributions in a heatmap.</p>
</div>


<p>Topics are probability distributions over the whole vocabulary of a text corpus. One value is assigned to each word, which indicates how relevant the word is to that topic (to be exact, how likely one word is to be found in a topic). After sorting those values in descending order, the first n words represent a topic.</p>

<p>Below the topics are ranked by their numerical dominance in the sample corpus; each bar displays a topic’s dominance score.</p>
<p>
HIER DIE BALKEN
</p>
<p>These topics describe the semantic structures of a text corpus. Every document of the corpus
consists, to a
certain
degree, of every topic.</p>
<p>Each document consists to a certain extent of each topic, which is one of the theoretical assumptions of topic models. Although some values are too small to be visualized here (and have therefore been rounded to zero), they are actually greater than zero.</p>
<p>Visualizing the document-topic proportions in a heatmap displays the kind of information that is probably most useful. Going beyond pure exploration, it can be used to show thematic developments over a set of texts, akin to a dynamic topic model.</p>
<p>HIER DIE HEATMAP</p>
<h2 id="troubleshooting">Troubleshooting</h2>
<ul class="simple">
<li>If you are confronted with
Expand Down Expand Up @@ -574,33 +571,34 @@ <h2 id="developing">Developing</h2>
please check out the
<a class="reference external" href="https://github.com/DARIAH-DE/TopicsExplorer#developing">GitHub</a>
page.</p>

<h2 id="about dariah">About DARIAH-DE</h2>
<p>
<a class="reference external" href="https://de.dariah.eu/">DARIAH-DE</a> supports research in the
humanities and
cultural sciences with digital methods and procedures. The research
infrastructure of DARIAH-DE consists of four pillars: teaching, research, research data and
technical
components.
As a partner in
<a class="reference external" href="http://dariah.eu/">DARIAH-EU</a>, DARIAH-DE helps to bundle and
network
state-of-the-art activities of the digital humanities. Scientists
use DARIAH, for example, to make research data available across Europe. The exchange of knowledge
and expertise
is
thus promoted across disciplines and the possibility of discovering new scientific discourses is
encouraged.</p>
<p>This application has been developed with support from the DARIAH-DE initiative, the German branch of
DARIAH-EU, the
European Digital Research Infrastructure for the Arts and Humanities consortium. Funding has been
provided by
the
German Federal Ministry for Research and Education (BMBF) under the identifier 01UG1610J.
</p>
</p>

<h2>What is topic modeling?</h2>
<ul>
<li><b>David M. Blei</b>, <a href="http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf">Probabilisitic
Topic Models</a>, in: <i>Communications of the ACM</i> 55 (2012).</li>
<li><b>Megan R. Brett</b>, <a href="http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/">Topic Modeling, A Basic Introduction</a>, in: <i>Journal of Digital Humanities</i> 2 (2012).</li>
<li><b>Matthew Jockers and David Mimno</b>, <a href="http://digitalcommons.unl.edu/englishfacpubs/105">Significant
Themes in 19th-Century Literature</a>, in: <i>Poetics</i> 41 (2013).</li>
<li><b>Steffen Pielström, Severin Simmler, Thorsten Vitt and Fotis Jannidis</b>, <a href="https://dh2018.adho.org/a-graphical-user-interface-for-lda-topic-modeling/">A
Graphical User Interface for LDA Topic Modeling</a>, in: <i>Proceedings of the 28th Digital
Humanities Conference</i> (2018).</li>
</ul>
<h2>What is DARIAH-DE?</h2>
<p><a href="https://de.dariah.eu/en">DARIAH-DE</a> supports research in the humanities and cultural sciences with
digital methods and procedures. The research infrastructure of DARIAH-DE consists of four pillars:
teaching, research, research data and technical components. As a partner in <a href="https://www.dariah.eu/">DARIAH-EU</a>,
DARIAH-DE helps to bundle and network state-of-the-art activities of the digital humanities. Scientists use
DARIAH, for example, to make research data available across Europe. The exchange of knowledge and expertise
is thus promoted across disciplines and the possibility of discovering new scientific discourses is
encouraged.</p>
<p>This application is developed with support from the DARIAH-DE initiative, the German branch of DARIAH-EU,
the European Digital Research Infrastructure for the Arts and Humanities consortium. Funding has been
provided by the German Federal Ministry for Research and Education (BMBF) under the identifier 01UG1610A to
J.</p>
<h2>License</h2>
<p>This application is licensed under <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0</a>. You can
do what you like with the source code, as long as you include the original copyright, the full text of the
Apache 2.0 license, and state significant changes. You cannot charge DARIAH-DE for damages, or use any of
its trademarks like name or logos.</p>
</div>
</main>
</div>
Expand Down

0 comments on commit 44aa518

Please sign in to comment.