Skip to content

Commit

Permalink
Bugfix
Browse files Browse the repository at this point in the history
  • Loading branch information
severinsimmler committed May 31, 2018
1 parent 447327d commit af72bef
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 6 deletions.
11 changes: 7 additions & 4 deletions application/modeling.py
Expand Up @@ -75,10 +75,13 @@ def workflow(tempdir, archive_dir):
tokenized_corpus[filename.stem] = tokens
parameter["Corpus size (raw), in tokens"] += len(tokens)

text = text.strip()
token_int = random.randint(0, len(text) - 251)
text = text.replace("\n", " ")
text = text.replace("\r", " ")
text = text.replace("\'", "")
text = text.replace("\"", "")
token_int = random.randint(0, len(text) - 351)
try:
excerpt = "...{}...".format(text[token_int:token_int + 250])
excerpt = "...{}...".format(text[token_int:token_int + 350])
except IndexError:
excerpt = ""

Expand All @@ -103,7 +106,7 @@ def workflow(tempdir, archive_dir):
stopwords = dariah_topics.preprocessing.find_stopwords(document_term_matrix, user_input["mfw"])
cleaning = "removed the <b>{0} most frequent words</b>, based on a threshold value".format(user_input["mfw"])
except KeyError:
yield "running", "Reading external stopwords list ...", progress / complete * 100, "", "", "", "", ""
yield "running", "Reading external stopwords list ...", progress / complete * 100, "", corpus_size, token_size, topic_size, iteration_size
stopwords = user_input["stopwords"].read().decode("utf-8")
stopwords = list(dariah_topics.preprocessing.tokenize(stopwords))
cleaning = "removed <b>{0} words</b>, based on an external stopwords list".format(len(stopwords))
Expand Down
4 changes: 2 additions & 2 deletions application/templates/index.html
Expand Up @@ -20,7 +20,7 @@ <h1>Topics – Easy Topic Modeling</h1>
</div>
<h2>1. Preprocessing</h2>
<p>Your text corpus is tokenized first. This splits a text into individual words, so-called <i>tokens</i>. Token frequencies are typical units of analysis when working with text corpora. It may come as a surprise that reducing a book to a list of token frequencies retains useful information, but practice has shown this to be the case. Usually the most common tokens in a document are <b>semantically insignificant words</b> like <i>the</i> or <i>and</i>. Because you are trying to uncover hidden semantic structures of a text collection, you have to get rid of these words before modeling. This is done during preprocessing.</p>
<h3>1.1. Reading a Corpus of Documents</h3>
<h3>1.1. Reading a corpus of documents</h3>
<p>For this workflow you need a text corpus (i.e. a collection of text files) as plain text (<b>.txt</b>) or as XML (<b>.xml</b>). Use the button below to select the files. To guarantee usable results, select <b>at least five documents</b> (but the more, the better).
<div class="alert alert-info">
<button type="button" class="close" data-dismiss="alert">&times;</button>
Expand All @@ -32,7 +32,7 @@ <h3>1.1. Reading a Corpus of Documents</h3>
<br>
<h3>1.2. Tokenization</h3>
<p>An important preprocessing step is tokenization. Without identifying words as separate units, it is impossible to determine necessary information such as word frequencies. In this application, a token consists of one consists of one or more characters, optionally followed by exactly one punctuation (e.g. a hyphen), followed by one or more characters. For example, the phrase “her father's arm-chair” is tokenized as <code>["her", "father's", "arm-chair"]</code>.</p>
<h3>1.3. Cleaning the Corpus</h3>
<h3>1.3. Cleaning the corpus</h3>
<p>The most common words in a document are also called stopwords. As described above, stopwords are harmful to the LDA model and must be removed before modeling. If you want to determine and remove the stopwords individually based on your corpus, you can define a threshold value here for the <i>n</i> most common words to be deleted.</p>
<div class="alert alert-info">
<button type="button" class="close" data-dismiss="alert">&times;</button>
Expand Down

0 comments on commit af72bef

Please sign in to comment.