Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
severinsimmler committed Apr 12, 2018
1 parent d36138e commit 7d72efe
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions templates/index.html
Expand Up @@ -33,7 +33,7 @@ <h3>1.1. Reading a Corpus of Documents</h3>
</p>
<input type="file" name="files" accept=".txt, .xml" multiple required/><br><br>
<h3>1.2. Tokenization</h3>
<p>An important preprocessing step is tokenization. Without identifying tokens, it is difficult to extract necessary information, such as token frequencies in general, or <b>most frequent tokens</b>, also known as <i>stopwords</i>. In this application,
<p>An important preprocessing step is tokenization. Without identifying tokens, it is difficult to extract necessary information, such as token frequencies in general, or <b>most frequent words</b>, also known as <i>stopwords</i>. In this application,
one token consists of one or more characters, optionally followed by exactly one punctuation (a hyphen or something related), followed by one or more characters. For example, the phrase “her father's arm-chair” will be tokenized as <code>["her", "father's", "arm-chair"]</code>.</p>
<h3>1.3. Cleaning the Corpus</h3>
<p>Stopwords are harmful for LDA and have to be removed from the corpus. In case you want to <b>determine stopwords individually</b> based on your corpus, define a threshold for most frequent words in the line below.</p>
Expand All @@ -56,7 +56,7 @@ <h2>2. Modeling</h2>
<br>
<h2>3. Visualizing</h2>
<p>When using LDA to explore text collections, we are typically interested in examining texts in terms of their <b>constituent topics</b> (instead of word frequencies). Because the number of topics is so much smaller than the number of unique vocabulary
elements (say, 10 versus 10,000), a range of data visualization methods become available. As you will see, all of the provided visualizations are <b>interactive</b>, but you will have the ability to save the plots as a <b>static image file</b>.</p>
elements (say, 10 versus 10,000), a range of data visualization methods become available. As you will see, all of the provided visualizations are <b>interactive</b>.</p>
<br>
<div class="center_button">
<button class="button" type="submit"><b>Train<br>Topic Model</b></button>
Expand All @@ -65,3 +65,4 @@ <h2>3. Visualizing</h2>
</form>
</div>
{% endblock %}

0 comments on commit 7d72efe

Please sign in to comment.