Graph extraction and NLP analysis for Baleen Corpora
Minke provides a command line script called
sei that allows you to interact with the Minke library and baleen corpora. For example, to sample a corpus to a smaller subset for testing or development you can do the following:
$ ./sei sample path/to/corpus path/to/sample
You can describe corpora using the
describe command as follows:
$ ./sei describe path/to/corpus
And you can preprocess a corpus into a pickled corpus:
$ ./sei preprocess path/to/html/corpus path/to/pickled/corpus
Many more options and configurations are available; use
./sei --help for more information and refer to the
conf/minke-example.conf configuration file.
The Baleen ingestion tool is used to create a corpus of web articles and blogs from RSS feeds. Minke extends Baleen with a library to perform text analysis and perform graph extraction on the exported corpora.
Baleen means “whale bone” and particularly refers to the straining bones that whales of the mysticeti suborder have. These bones filter food from water as the Baleen ingestion engine filters content from the web. Minke whales are a specific species of rorqual whales, one of the shortest in fact. This library is named to indicate it's a short version of the larger Baleen codebase.