Bugfix

DARIAH-DE · May 31, 2018 · af72bef · af72bef
1 parent 447327d
commit af72bef
Show file tree

Hide file tree

Showing 2 changed files with 9 additions and 6 deletions.
diff --git a/application/modeling.py b/application/modeling.py
@@ -75,10 +75,13 @@ def workflow(tempdir, archive_dir):
             tokenized_corpus[filename.stem] = tokens
             parameter["Corpus size (raw), in tokens"] += len(tokens)
 
-        text = text.strip()
-        token_int = random.randint(0, len(text) - 251)
+        text = text.replace("\n", " ")
+        text = text.replace("\r", " ")
+        text = text.replace("\'", "")
+        text = text.replace("\"", "")
+        token_int = random.randint(0, len(text) - 351)
         try:
-            excerpt = "...{}...".format(text[token_int:token_int + 250])
+            excerpt = "...{}...".format(text[token_int:token_int + 350])
         except IndexError:
             excerpt = ""
 
@@ -103,7 +106,7 @@ def workflow(tempdir, archive_dir):
             stopwords = dariah_topics.preprocessing.find_stopwords(document_term_matrix, user_input["mfw"])
             cleaning = "removed the <b>{0} most frequent words</b>, based on a threshold value".format(user_input["mfw"])
         except KeyError:
-            yield "running", "Reading external stopwords list ...", progress / complete * 100, "", "", "", "", ""
+            yield "running", "Reading external stopwords list ...", progress / complete * 100, "", corpus_size, token_size, topic_size, iteration_size
             stopwords = user_input["stopwords"].read().decode("utf-8")
             stopwords = list(dariah_topics.preprocessing.tokenize(stopwords))
             cleaning = "removed <b>{0} words</b>, based on an external stopwords list".format(len(stopwords))

diff --git a/application/templates/index.html b/application/templates/index.html
@@ -20,7 +20,7 @@ <h1>Topics – Easy Topic Modeling</h1>
     </div>
     <h2>1. Preprocessing</h2>
     <p>Your text corpus is tokenized first. This splits a text into individual words, so-called <i>tokens</i>. Token frequencies are typical units of analysis when working with text corpora. It may come as a surprise that reducing a book to a list of token frequencies retains useful information, but practice has shown this to be the case. Usually the most common tokens in a document are <b>semantically insignificant words</b> like <i>the</i> or <i>and</i>. Because you are trying to uncover hidden semantic structures of a text collection, you have to get rid of these words before modeling. This is done during preprocessing.</p>
-    <h3>1.1. Reading a Corpus of Documents</h3>
+    <h3>1.1. Reading a corpus of documents</h3>
     <p>For this workflow you need a text corpus (i.e. a collection of text files) as plain text (<b>.txt</b>) or as XML (<b>.xml</b>). Use the button below to select the files. To guarantee usable results, select <b>at least five documents</b> (but the more, the better).
       <div class="alert alert-info">
         <button type="button" class="close" data-dismiss="alert">&times;</button>
@@ -32,7 +32,7 @@ <h3>1.1. Reading a Corpus of Documents</h3>
     <br>
     <h3>1.2. Tokenization</h3>
     <p>An important preprocessing step is tokenization. Without identifying words as separate units, it is impossible to determine necessary information such as word frequencies. In this application, a token consists of one consists of one or more characters, optionally followed by exactly one punctuation (e.g. a hyphen), followed by one or more characters. For example, the phrase “her father's arm-chair” is tokenized as <code>["her", "father's", "arm-chair"]</code>.</p>
-    <h3>1.3. Cleaning the Corpus</h3>
+    <h3>1.3. Cleaning the corpus</h3>
     <p>The most common words in a document are also called stopwords. As described above, stopwords are harmful to the LDA model and must be removed before modeling. If you want to determine and remove the stopwords individually based on your corpus, you can define a threshold value here for the <i>n</i> most common words to be deleted.</p>
     <div class="alert alert-info">
       <button type="button" class="close" data-dismiss="alert">&times;</button>