Skip to content

Commit

Permalink
Extend documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
severinsimmler committed Jan 28, 2018
1 parent 2d63c47 commit de6827c
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 16 deletions.
2 changes: 1 addition & 1 deletion demonstrator/templates/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ <h1>Topics – Easy Topic Modeling</h1>
<p>LDA, introduced in the context of text analysis in 2003, is an instance of a more general class of models called <b>mixed-membership models</b>. Involving a number of distributions and parameters, the model is typically performed using <b>Gibbs sampling</b> with conjugate priors and is purely based on word frequencies.</p>
<div class="alert alert-block">
<button type="button" class="close" data-dismiss="alert">&times;</button>
<i class="fa fa-exclamation-circle"></i> This application is designed to introduce the technique in a gentle way. If you have a <b>very large corpus</b> (let's say more than 200 documents with more than 5000 words per document), you may
<i class="fa fa-exclamation-circle"></i> This application is designed to introduce the technique in a gentle way and aims for simplicity. If you have a <b>very large corpus</b> (let's say more than 200 documents with more than 5000 words per document), you may
wish to use more sophisticated models such as those implemented in <b>MALLET</b>, which is known to be more robust than standard LDA. Have a look at our Jupyter notebook introducing <i>Topic Modeling with MALLET</i>, available via GitHub
(https://github.com/DARIAH-DE/Topics).
</div>
Expand Down
4 changes: 2 additions & 2 deletions demonstrator/templates/result.html
Original file line number Diff line number Diff line change
Expand Up @@ -90,9 +90,9 @@ <h3>2.2. Topics and Documents</h3>
highly with a specific text or group of texts. All in all, this displays two of LDA's properties – its use as a distant reading tool that aims to get at text meaning, and its use as a provider of data that can be further used in computational
analysis, such as document classification or authorship attribution.</p><br> {{ heatmap_div|safe }}<br><br>
<h3>2.3. Topic Proportions of Documents</h3>
<br>{{ topics_div|safe }}<br><br>
{{ topics_div|safe }}<br>
<h3>2.4. Document Proportions of Topics</h3>
<br>{{ documents_div|safe }}<br><br><br>
{{ documents_div|safe }}<br>
<h2>2. Diving Deeper into Topic Modeling</h2>
<p>We want to empower users with little or no previous experience and programming skills to create custom workflows mostly using predefined functions within a familiar environment. So, if this practical introduction aroused your interest and
you want to <b>dive deeper into the technical parts</b>, we provide another convenient, modular workflow that can be entirely controlled from within a well documented Jupyter notebook, integrating a total of three popular LDA implementations.</p>
Expand Down
27 changes: 14 additions & 13 deletions demonstrator/webapp.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def process_xml(file):


def boxplot(stats):
x_labels = ['Corpus (clean)', 'Corpus (raw)']
x_labels = ['Document size (clean)', 'Document size (raw)']

groups = stats.groupby('group')
q1 = groups.quantile(q=0.25)
Expand Down Expand Up @@ -102,6 +102,7 @@ def outliers(group):
fig.rect(x_labels, lower.score, 0.2, 0.01, line_color='black')
fig.rect(x_labels, upper.score, 0.2, 0.01, line_color='black')

fig.yaxis.axis_label = 'Tokens'
fig.xgrid.grid_line_color = None
fig.ygrid.grid_line_color = 'white'
fig.grid.grid_line_width = 2
Expand Down Expand Up @@ -197,7 +198,7 @@ def modeling():
document_labels = tokenized_corpus.index
document_term_matrix = preprocessing.create_document_term_matrix(tokenized_corpus, document_labels)
stats = pd.DataFrame({'score': np.array(document_term_matrix.sum(axis=1)),
'group': ['Corpus (raw)' for x in range(len(tokenized_corpus))]})
'group': ['Document size (raw)' for x in range(len(tokenized_corpus))]})

if request.files.get('stopword_list', None):
log.info("Accessing external stopwords list ...")
Expand All @@ -215,7 +216,7 @@ def modeling():
features = [token for token in features if token in document_term_matrix.columns]
document_term_matrix = document_term_matrix.drop(features, axis=1)
stats = stats.append(pd.DataFrame({'score': np.array(document_term_matrix.sum(axis=1)),
'group': ['Corpus (cleaned)' for x in range(len(tokenized_corpus))]}))
'group': ['Document size (cleaned)' for x in range(len(tokenized_corpus))]}))
parameter.append(int(document_term_matrix.values.sum()))
document_term_arr = document_term_matrix.as_matrix().astype(int)
log.info("Accessing corpus vocabulary ...")
Expand All @@ -240,15 +241,15 @@ def modeling():
log.info("Creating interactive heatmap ...")
if document_topics.shape[0] < document_topics.shape[1]:
if document_topics.shape[1] < 20:
height = 20 * 25
height = 20 * 28
else:
height = document_topics.shape[1] * 25
height = document_topics.shape[1] * 28
document_topics_heatmap = document_topics.T # todo: Fix hover when transposed
else:
if document_topics.shape[0] < 20:
height = 20 * 25
height = 20 * 28
else:
height = document_topics.shape[0] * 25
height = document_topics.shape[0] * 28
document_topics_heatmap = document_topics
fig = visualization.PlotDocumentTopics(document_topics_heatmap,
enable_notebook=False)
Expand All @@ -261,25 +262,25 @@ def modeling():

log.info("Creating interactive barcharts ...")
if document_topics.shape[1] < 10:
height = 10 * 15
height = 10 * 18
else:
height = document_topics.shape[1] * 15
height = document_topics.shape[1] * 18
topics_barchart = barchart(document_topics, height=height)
topics_script, topics_div = components(topics_barchart)

if document_topics.shape[0] < 10:
height = 10 * 15
height = 10 * 18
else:
height = document_topics.shape[1] * 15
height = document_topics.shape[0] * 18
documents_barchart = barchart(document_topics.T, height=height, topics=False)
documents_script, documents_div = components(documents_barchart)

js_resources = INLINE.render_js()
css_resources = INLINE.render_css()
end = time.time()
passed_time = round((end - start) / 60)
index = ['Corpus size in documents', 'Corpus size in tokens', 'Corpus size in tokens (cleaned)',
'Size of vocabulary (cleaned)', 'Number of topics', 'Number of iterations', 'The model\'s log likelihood']
index = ['Corpus size in documents', 'Corpus size in tokens (raw)', 'Corpus size in tokens (clean)',
'Size of vocabulary (clean)', 'Number of topics', 'Number of iterations', 'The model\'s log likelihood']
if passed_time == 0:
index.append('Passed time in seconds')
parameter.append(round(end - start))
Expand Down

0 comments on commit de6827c

Please sign in to comment.