Extend documentation

DARIAH-DE · Jan 28, 2018 · de6827c · de6827c
1 parent 2d63c47
commit de6827c
Show file tree

Hide file tree

Showing 3 changed files with 17 additions and 16 deletions.
diff --git a/demonstrator/templates/index.html b/demonstrator/templates/index.html
@@ -59,7 +59,7 @@ <h1>Topics – Easy Topic Modeling</h1>
               <p>LDA, introduced in the context of text analysis in 2003, is an instance of a more general class of models called <b>mixed-membership models</b>. Involving a number of distributions and parameters, the model is typically performed using <b>Gibbs sampling</b>                with conjugate priors and is purely based on word frequencies.</p>
               <div class="alert alert-block">
                 <button type="button" class="close" data-dismiss="alert">&times;</button>
-                <i class="fa fa-exclamation-circle"></i> This application is designed to introduce the technique in a gentle way. If you have a <b>very large corpus</b> (let's say more than 200 documents with more than 5000 words per document), you may
+                <i class="fa fa-exclamation-circle"></i> This application is designed to introduce the technique in a gentle way and aims for simplicity. If you have a <b>very large corpus</b> (let's say more than 200 documents with more than 5000 words per document), you may
                 wish to use more sophisticated models such as those implemented in <b>MALLET</b>, which is known to be more robust than standard LDA. Have a look at our Jupyter notebook introducing <i>Topic Modeling with MALLET</i>, available via GitHub
                 (https://github.com/DARIAH-DE/Topics).
               </div>

diff --git a/demonstrator/templates/result.html b/demonstrator/templates/result.html
@@ -90,9 +90,9 @@ <h3>2.2. Topics and Documents</h3>
               highly with a specific text or group of texts. All in all, this displays two of LDA's properties – its use as a distant reading tool that aims to get at text meaning, and its use as a provider of data that can be further used in computational
               analysis, such as document classification or authorship attribution.</p><br> {{ heatmap_div|safe }}<br><br>
               <h3>2.3. Topic Proportions of Documents</h3>
-              <br>{{ topics_div|safe }}<br><br>
+              {{ topics_div|safe }}<br>
               <h3>2.4. Document Proportions of Topics</h3>
-              <br>{{ documents_div|safe }}<br><br><br>
+              {{ documents_div|safe }}<br>
             <h2>2. Diving Deeper into Topic Modeling</h2>
             <p>We want to empower users with little or no previous experience and programming skills to create custom workflows mostly using predefined functions within a familiar environment. So, if this practical introduction aroused your interest and
               you want to <b>dive deeper into the technical parts</b>, we provide another convenient, modular workflow that can be entirely controlled from within a well documented Jupyter notebook, integrating a total of three popular LDA implementations.</p>

diff --git a/demonstrator/webapp.py b/demonstrator/webapp.py
@@ -69,7 +69,7 @@ def process_xml(file):
 
 
 def boxplot(stats):
-    x_labels = ['Corpus (clean)', 'Corpus (raw)']
+    x_labels = ['Document size (clean)', 'Document size (raw)']
 
     groups = stats.groupby('group')
     q1 = groups.quantile(q=0.25)
@@ -102,6 +102,7 @@ def outliers(group):
     fig.rect(x_labels, lower.score, 0.2, 0.01, line_color='black')
     fig.rect(x_labels, upper.score, 0.2, 0.01, line_color='black')
 
+    fig.yaxis.axis_label = 'Tokens'
     fig.xgrid.grid_line_color = None
     fig.ygrid.grid_line_color = 'white'
     fig.grid.grid_line_width = 2
@@ -197,7 +198,7 @@ def modeling():
     document_labels = tokenized_corpus.index
     document_term_matrix = preprocessing.create_document_term_matrix(tokenized_corpus, document_labels)
     stats = pd.DataFrame({'score': np.array(document_term_matrix.sum(axis=1)),
-                          'group': ['Corpus (raw)' for x in range(len(tokenized_corpus))]})
+                          'group': ['Document size (raw)' for x in range(len(tokenized_corpus))]})
 
     if request.files.get('stopword_list', None):
         log.info("Accessing external stopwords list ...")
@@ -215,7 +216,7 @@ def modeling():
     features = [token for token in features if token in document_term_matrix.columns]
     document_term_matrix = document_term_matrix.drop(features, axis=1)
     stats = stats.append(pd.DataFrame({'score': np.array(document_term_matrix.sum(axis=1)),
-                                       'group': ['Corpus (cleaned)' for x in range(len(tokenized_corpus))]}))
+                                       'group': ['Document size (cleaned)' for x in range(len(tokenized_corpus))]}))
     parameter.append(int(document_term_matrix.values.sum()))
     document_term_arr = document_term_matrix.as_matrix().astype(int)
     log.info("Accessing corpus vocabulary ...")
@@ -240,15 +241,15 @@ def modeling():
     log.info("Creating interactive heatmap ...")
     if document_topics.shape[0] < document_topics.shape[1]:
         if document_topics.shape[1] < 20:
-            height = 20 * 25
+            height = 20 * 28
         else:
-            height = document_topics.shape[1] * 25
+            height = document_topics.shape[1] * 28
         document_topics_heatmap = document_topics.T # todo: Fix hover when transposed
     else:
         if document_topics.shape[0] < 20:
-            height = 20 * 25
+            height = 20 * 28
         else:
-            height = document_topics.shape[0] * 25
+            height = document_topics.shape[0] * 28
         document_topics_heatmap = document_topics
     fig = visualization.PlotDocumentTopics(document_topics_heatmap,
                                            enable_notebook=False)
@@ -261,25 +262,25 @@ def modeling():
 
     log.info("Creating interactive barcharts ...")
     if document_topics.shape[1] < 10:
-        height = 10 * 15
+        height = 10 * 18
     else:
-        height = document_topics.shape[1] * 15
+        height = document_topics.shape[1] * 18
     topics_barchart = barchart(document_topics, height=height)
     topics_script, topics_div = components(topics_barchart)
 
     if document_topics.shape[0] < 10:
-        height = 10 * 15
+        height = 10 * 18
     else:
-        height = document_topics.shape[1] * 15
+        height = document_topics.shape[0] * 18
     documents_barchart = barchart(document_topics.T, height=height, topics=False)
     documents_script, documents_div = components(documents_barchart)
 
     js_resources = INLINE.render_js()
     css_resources = INLINE.render_css()
     end = time.time()
     passed_time = round((end - start) / 60)
-    index = ['Corpus size in documents', 'Corpus size in tokens', 'Corpus size in tokens (cleaned)',
-             'Size of vocabulary (cleaned)', 'Number of topics', 'Number of iterations', 'The model\'s log likelihood']
+    index = ['Corpus size in documents', 'Corpus size in tokens (raw)', 'Corpus size in tokens (clean)',
+             'Size of vocabulary (clean)', 'Number of topics', 'Number of iterations', 'The model\'s log likelihood']
     if passed_time == 0:
         index.append('Passed time in seconds')
         parameter.append(round(end - start))