Tool for extracting topics, keywords and their co-occurence patterns (forming so-called frames) from a Dutch corpus.
The Frame Generator was created in collaboration with prof. dr. Joris van Eijnatten (UU) during his KB Digital Humanities Fellowship in 2016. Its aim is to meaningfully reducing a set of texts to word patterns that cut across the distributions generated by topic modelling, thus providing additional insight into the content of the data set.
- Generate topics with the Mallet or Gensim topic modelling library
- Extract a single, ranked list of keywords based on either topics or tf-idf scores
- Find co-occurrence patterns for the keywords in the texts from which they were originally extracted
- Optionally lemmatize and pos-tag the input texts with NLP suite Frog and restrict the keywords and collocates to specific part-of-speech tags
- Optionally correct OCR errors and spelling variations with user-provided lists of regular expressions
- Access and reuse the results of each processing stage as comma-separated values files
- Use from the command line, as a Python library or web application
- Python 2.7
- Mallet 2.0.7
Clone or download the GitHub repository:
$ git clone https://github.com/jlonij/frame-generator
Install the required Python packages:
$ cd frame-generator $ pip install -r requirements.txt
Optional install Frog + Frog wapper (Frogger) locally.
Since Frog causes heavy load-spike's on our infrastructure availability cannot be guaranteed. The endpoint to which frame-generator will try to connect is a demo sever, which might be down, if you want to analyze a lot of data, change the endpoint here:
Install Frog + Dependencies: See here: https://github.com/LanguageMachines/frog
$ frog -S 4096
Place the directory frogger in your Apache2 www-root (/var/www/frogger/), it will use a .htaccess file to launch the application from Apache.
Test the wrapper without HTTP:
$ mv frogger /var/www/; cd /var/www/frogger; python frog.py
This should ouput some test text if all went well, now try the wrapper with HTTP:
$ curl -s http://localhost/frogger/?text="Dit is een test"
This sould return some test text.
Basic command line execution with the default values for all options:
$ cd frame-generator $ ./generator.py
This will generate keywords and frames for the sample documents provided and print the results:
Keywords and frames generated: (1) zotheid/N [0.153417875946] prijzen/WW (1.81959197914), verkondigen/WW (0.778800783071), kwaden/ADJ (0.472366552741), gepast/ADJ (0.367879441171), staan/WW (0.28650479686), aanstonds/ADJ (0.28650479686) (2) god/N [0.0681857686941] goddelijk/ADJ (0.606530659713), stellen/WW (0.606530659713), vervroolijk/ADJ (0.472366552741), uitzonderen/WW (0.367879441171) (3) mensch/N [0.0271176554179] vervroolijk/ADJ (0.778800783071), toeschrijven/WW (0.778800783071), gewoonlijk/ADJ (0.606530659713), verjagen/WW (0.367879441171), goddelijk/ADJ (0.367879441171), spreken/WW (0.28650479686) ...
Command line interface
Input files are expected to be either utf-8 or iso-8859-1 encoded and have to be placed in the appropriate
docscontains plain text files with
.txtextension of the documents to be processed.
stopcontains optional stop word lists to be applied when creating the vocabulary. The stop word lists should be plain text files with
.txtextension in which each word occupies a single line.
regexcontains optional lists of regular expressions to be replaced in the input documents. These lists should have a
.tsvextension and consist of two tab-separated columns, the first containing the regular expression and the second its intended replacement.
The Frame Generator command line interface accepts a number of options to control the process:
--gtype: the type of results to be generated. The user can choose between generating
frames. The default value is
--dlen: the number of sentences the subdocuments used as units of analysis are to contain. Default value is
0, in which case the original, unsplit documents will be used.
--nopos: when this option is entered (no value required) the part-of-speech tagging functionality of the Frame Generator is turned off. This saves a lot of processing time and allows the generator to run offline.
--tcount: the number of topics to be generated. Default value is
--tsize: the number of words to be contained in each topic. Default value is
--mallet: full path to the Mallet executable; if not provided, Gensim's LDA implementation will be used to generate topics.
--kmodel: model to be used for scoring keywords, either
tf-idf. By default
ldais used, meaning keywords are extracted on the basis of a topic model.
--kcount: the number of keywords to be generated. Default value is
--ktags: the part-of-speech tags to be included in the keyword list separated by spaces, e.g.
ADJ N WW.
--wdir: the direction,
right, of the keyword in which frame words are searched for. When omitted both directions are taken into account.
--wsize: the maximum word distance of a frame word to the keyword. Default value is
--fsize: the maximum number of frame words to be generated with each keyword. Default value is
--ftags: the part-of-speech tags to be included in the keyword list, e.g.
ADJ N WW.
Values accepted as part-of-speech tags with the
--ftags options are the following main tags from the CGN tag set:
SPECNames and unknown
Application programming interface
The Frame Generator can be used from another Python script or the Python interpreter by calling the
generate() function in the
>>> import generator >>> _, keyword_list, frame_list = generator.generate() >>> keyword_list.print_keywords() Keywords generated: (1) zotheid/N [0.169311243818] (2) god/N [0.0725594158983] (3) mensch/N [0.034531287966] ...
The arguments of the function correspond to the command line options listed above, the function signature being:
generate(gtype='frames', dlen=0, pos=True, tcount=10, tsize=10, mallet=None, kmodel='lda', kcount=10, ktags=, wdir=None, wsize=5, fsize=10, ftags=, input_dir='input', output_dir='output')
The Frame generator can also be run as a simple Bottle web application accepting post requests:
By default the service is started at
An online demo providing a graphical user interface to the Frame Generator’s main functionality and a basic visualization of the results is available at http://www.kbresearch.nl/frames/. The source code of the demo can be found here.