Command-line tool to extract a ranked list of relevant keywords from a corpus with the option of using either topic modeling or tf-idf scores.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
.gitignore
LICENSE
README.md
corpus.py
keywords_lda.py
keywords_tfidf.py

README.md

Keyword Generator

The Keyword Generator, created in collaboration with KB Researcher-in-residence Pim Huijnen, is a command-line tool that offers two methods to extract relevant keywords from a collection of sample texts provided by the user:

  1. keywords_tfidf.py, extracting keywords based on tf-idf scores. Options:
  • -k : number of keywords to be generated (default 10)
  • -d : document length (the documents provided by the user will be split into parts containing the specified number of words; by default the documents will not be split.)
  1. keywords_lda.py, extracting keywords based on either Gensim's or Mallet's implementation of LDA topic modeling. Options:
  • -t : number of topics (default 10)
  • -w : number of words per topic (default 10)
  • -k : number of keywords (default 10)
  • -d : document length (the documents provided by the user will be split into parts containing the specified number of words; by default the documents will not be split.)
  • m : mallet path (full path to the Mallet executable; if not provided, Gensim's LDA implementation will be used to generate topics.)

Documents are to be placed in the data/documents folder, stop word lists in the data/stop_words folder. The keyword lists and any topics and topic distributions generated will be saved in the data/results folder.

The Keyword Generator currently uses Python 2.7, and Gensim and Mallet need to be installed locally.

Some examples of commands:

$ ./keywords_tfidf.py
$ ./keywords_tfidf.py -k 20 -d 100
$ ./keywords_lda.py -k 10 -d 100 -t 5 -w 20
$ ./keywords_lda.py -k 10 -d 100 -t 5 -w 20 -m /opt/mallet-2.0.7/bin/mallet