Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
7d72efe
commit a51363d
Showing
55 changed files
with
267 additions
and
192 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
from application import web | ||
from application import gui | ||
from application import utils |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import pathlib | ||
import sys | ||
import flask | ||
|
||
def create_app(**kwargs): | ||
""" | ||
Creates a Flask app and determines the path for bokeh resources. If the | ||
scripts were frozen with PyInstaller, the paths are adjusted accordingly. | ||
""" | ||
if getattr(sys, 'frozen', False): | ||
root = pathlib.Path(sys._MEIPASS) | ||
app = flask.Flask(import_name=__name__, | ||
template_folder=str(pathlib.Path(root, 'templates')), | ||
static_folder=str(pathlib.Path(root, 'static')), | ||
**kwargs) | ||
bokeh_resources = str(pathlib.Path(root, 'static', 'bokeh_templates')) | ||
else: | ||
app = flask.Flask(import_name=__name__, **kwargs) | ||
bokeh_resources = str(pathlib.Path('static', 'bokeh_templates')) | ||
return app, bokeh_resources |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import sys | ||
import pathlib | ||
import PyQt5.QtGui | ||
import PyQt5.QtWidgets | ||
import PyQt5.QtWebEngineWidgets | ||
import PyQt5.QtCore | ||
import application.web | ||
|
||
|
||
PORT = 5000 | ||
ROOT_URL = 'http://localhost:{port}'.format(port=PORT) | ||
|
||
|
||
class FlaskThread(PyQt5.QtCore.QThread): | ||
def __init__(self, application): | ||
PyQt5.QtCore.QThread.__init__(self) | ||
self.application = application | ||
|
||
def __del__(self): | ||
self.wait() | ||
|
||
def run(self): | ||
self.application.run(port=PORT) | ||
|
||
|
||
def provide_gui(application): | ||
""" | ||
Opens a QtWebEngine window, runs the Flask application, and renders the | ||
index.html page. | ||
""" | ||
title = 'Topics Explorer' | ||
icon = str(pathlib.Path('static', 'img', 'page_icon.png')) | ||
width = 1200 | ||
height = 660 | ||
|
||
qtapp = PyQt5.QtWidgets.QApplication(sys.argv) | ||
|
||
webapp = FlaskThread(application) | ||
webapp.start() | ||
|
||
qtapp.aboutToQuit.connect(webapp.terminate) | ||
|
||
webview = PyQt5.QtWebEngineWidgets.QWebEngineView() | ||
webview.resize(width, height) | ||
webview.setWindowTitle(title) | ||
webview.setWindowIcon(PyQt5.QtGui.QIcon(icon)) | ||
|
||
webview.load(PyQt5.QtCore.QUrl(ROOT_URL)) | ||
webview.show() | ||
return qtapp.exec_() | ||
|
||
|
||
def run(): | ||
sys.exit(provide_gui(application.web.app)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
{% extends "layout.html" %} | ||
{% block navigation %} | ||
<ul class="nav pull-right"> | ||
<li> | ||
<a href="{{ url_for('help') }}"><i class="icon-question-sign icon-white"></i> Help</a> | ||
</li> | ||
</ul> | ||
{% endblock %} | ||
{% block content %} | ||
<h1>Topics – Easy Topic Modeling</h1> | ||
<div id="contentInner"> | ||
<form action="/modeling" method="POST" enctype="multipart/form-data"> | ||
<p>The text mining technique <b>Topic Modeling</b> has become a popular statistical method for clustering documents. This application introduces a user-friendly workflow, basically containing data preprocessing, the actual modeling using <b>latent Dirichlet allocation</b> (LDA), as well as various interactive visualizations to explore the model.</p> | ||
<p>LDA, introduced in the context of text analysis in 2003, is an instance of a more general class of models called <b>mixed-membership models</b>. Involving a number of distributions and parameters, the model is typically performed using <b>Gibbs sampling</b> with conjugate priors and is purely based on word frequencies.</p> | ||
<div class="alert alert-block"> | ||
<button type="button" class="close" data-dismiss="alert">×</button> | ||
<i class="fa fa-exclamation-circle"></i> This application is designed to introduce the technique in a gentle way and aims for simplicity. If you have a <b>very large corpus</b> (let's say more than 200 documents with more than 5000 words per document), | ||
you may wish to use more sophisticated models such as those implemented in <b>MALLET</b>, which is known to be more robust than standard LDA. Have a look at our Jupyter notebook introducing topic modeling with MALLET, available via <a href="https://github.com/DARIAH-DE/Topics">GitHub</a>. | ||
</div> | ||
<br> | ||
<h2>1. Preprocessing</h2> | ||
<p>A lot of harmful information, at least harmful for LDA, is sticking in your raw text collection. This is why preprocessing is a very crucial step for this workflow, and for <i>natural language processing</i> in general. First of all, your corpus will | ||
be <b>tokenized</b>. This is the process of splitting a text into individual words (so-called <i>tokens</i>). Token frequencies are typical units of analysis when working with text corpora. It may come as a surprise that reducing a book to a list | ||
of token frequencies retains useful information, but practice has shown this to be the case. Normally, the most frequent tokens of a document tend to be <b>semantically insignificant words</b> (like <i>the</i> or <i>and</i>, for instance). Because | ||
you are trying to uncover hidden semantic structures of a text collection, you have to get rid of those insignificant words before modeling. This will be done while preprocessing.</p> | ||
<h3>1.1. Reading a Corpus of Documents</h3> | ||
<p>For this workflow, you will need a corpus (a set of texts) as plain text (<b>.txt</b>) or XML (<b>.xml</b>). TEI encoded XML is fully supported to process only the text part. Use the button below to select multiple text files. To gain better results, | ||
<b>choose at least five documents</b> (but the more the better). | ||
<div class="alert alert-info"> | ||
<button type="button" class="close" data-dismiss="alert">×</button> | ||
<b>Tip:</b> The <a href="https://textgridrep.org">TextGrid Repository</a> is a great place to start searching for text data. It's Open Access and provides a lot of literary texts in valid and well-formed TEI XML. | ||
</div> | ||
</p> | ||
<input type="file" name="files" accept=".txt, .xml" multiple required/><br><br> | ||
<h3>1.2. Tokenization</h3> | ||
<p>An important preprocessing step is tokenization. Without identifying tokens, it is difficult to extract necessary information, such as token frequencies in general, or <b>most frequent words</b>, also known as <i>stopwords</i>. In this application, | ||
one token consists of one or more characters, optionally followed by exactly one punctuation (a hyphen or something related), followed by one or more characters. For example, the phrase “her father's arm-chair” will be tokenized as <code>["her", "father's", "arm-chair"]</code>.</p> | ||
<h3>1.3. Cleaning the Corpus</h3> | ||
<p>Stopwords are harmful for LDA and have to be removed from the corpus. In case you want to <b>determine stopwords individually</b> based on your corpus, define a threshold for most frequent words in the line below.</p> | ||
<div class="alert alert-info"> | ||
<button type="button" class="close" data-dismiss="alert">×</button> | ||
<b>Tip:</b> Be careful with removing most frequent words – you might remove words quite important for LDA. Anyway, to gain better results, it is highly recommended to use an <b>external stopwords list</b>. This application was shipped with stopword | ||
lists for English, German, Spanish, and French. | ||
</div> | ||
<input type="number" name="mfw_threshold" value="150" min="1"> | ||
<p>Alternatively, upload your own tokens-to-remove list here:</p> | ||
<input type="file" name="stopword_list"><br><br> | ||
<h2>2. Modeling</h2> | ||
<p>In this workflow, we are relying on an implementation by Allen Riddell, which is lightweight, fast and provides basic LDA. You have to specify some <b>model parameters</b> in this section, first of all the number of topics. The best number depends | ||
on what you are looking for in the model. The default will provide a <b>broad overview</b> of the contents of the corpus. The number of topics should also depend to some degree on the size of the text collection, but 100 to 200 will produce reasonably | ||
<b>fine-grained results</b>.</p> | ||
<input type="number" name="num_topics" value="10" min="1" required> | ||
<p>An iteration is a process of repeating the same action multiple times to achieve a specific goal. This is how LDA works. The number of sampling iterations should be a <b>trade-off</b> between the time taken to complete sampling and the quality of | ||
the model. The default value produces quite good results, but feel free to increase the number of iterations.</p> | ||
<input type="number" name="num_iterations" value="200" min="10" required><br> | ||
<br> | ||
<h2>3. Visualizing</h2> | ||
<p>When using LDA to explore text collections, we are typically interested in examining texts in terms of their <b>constituent topics</b> (instead of word frequencies). Because the number of topics is so much smaller than the number of unique vocabulary | ||
elements (say, 10 versus 10,000), a range of data visualization methods become available. As you will see, all of the provided visualizations are <b>interactive</b>.</p> | ||
<br> | ||
<div class="center_button"> | ||
<button class="button" type="submit"><b>Train<br>Topic Model</b></button> | ||
</div> | ||
<br> | ||
</form> | ||
</div> | ||
{% endblock %} | ||
|
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.