Graph extraction and NLP analysis for Baleen Corpora
Python Jupyter Notebook Makefile
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
conf
docs
fixtures
minke
notebooks
tests
.gitignore
.travis.yml
LICENSE
Makefile
README.md
mkdocs.yml
requirements.txt
sei
setup.py

README.md

Minke

Graph extraction and NLP analysis for Baleen Corpora

Build Status Coverage Status Code Health Stories in Ready

Minke Whale

Quickstart

Minke provides a command line script called sei that allows you to interact with the Minke library and baleen corpora. For example, to sample a corpus to a smaller subset for testing or development you can do the following:

$ ./sei sample path/to/corpus path/to/sample

You can describe corpora using the describe command as follows:

$ ./sei describe path/to/corpus

And you can preprocess a corpus into a pickled corpus:

$ ./sei preprocess path/to/html/corpus path/to/pickled/corpus

Many more options and configurations are available; use ./sei --help for more information and refer to the conf/minke-example.conf configuration file.

About

The Baleen ingestion tool is used to create a corpus of web articles and blogs from RSS feeds. Minke extends Baleen with a library to perform text analysis and perform graph extraction on the exported corpora.

Baleen means “whale bone” and particularly refers to the straining bones that whales of the mysticeti suborder have. These bones filter food from water as the Baleen ingestion engine filters content from the web. Minke whales are a specific species of rorqual whales, one of the shortest in fact. This library is named to indicate it's a short version of the larger Baleen codebase.

Throughput

Throughput Graph

Attribution

The image used in this README, "Minke whale 1" by Len2040 is licensed under CC BY-ND 2.0