Skip to content

Releases: RTIInternational/gobbli

v0.2.4 - JOSS paper, dependency fixes

23 Jun 20:28
d9ec813
Compare
Choose a tag to compare

Repository now includes the paper submitted to JOSS.

Also fixed a few issues with dependency changes causing errors building some of the Docker containers.

v0.2.3 - Fix for interactive apps on Windows

01 Sep 12:48
2f4f6ab
Compare
Choose a tag to compare

Fix a bug when running interactive apps on Windows.

v0.2.2 - Classification report bug fix

21 Aug 19:56
Compare
Choose a tag to compare

Fixes a bug that was causing errors in the classification report printed by experiment results.

v0.2.1 - Docs fixes

27 May 14:10
Compare
Choose a tag to compare

Some minor fixes for the docs -- no functionality should be changed.

v0.2.0 - Python 3.8, multilabel classification, backtranslation

26 May 20:50
Compare
Choose a tag to compare

Major items in this release:

Python 3.8 support

We now run CI on both Python 3.7 and Python 3.8, making both versions officially supported. The only major change for the upgrade is that we now require the installed version of ray to be at least 0.8.4.

Multilabel classification

Most models now transparently support true multilabel classification, where the output layer of the model reports a predicted probability for each label rather than a single predicted class. Simply pass a List[List[str]] in place of a List[str] whenever you're setting labels, where each inner List[str] is a list of labels that apply to the document. The model will infer the set of all labels from your data and generate a predicted probability for each label on all new data. Also added a benchmark dataset for the multilabel classification case: the CMU Movie Summary dataset. The interactive apps should also work -- note labels are delimited in CSV/TSV files using nested commas by default, but this can be changed using a command line argument.

Backtranslation

Implemented a new data augmentation approach based on transformers' implementation of the Marian Machine Translation model. Pass a list of target languages, and the model will translate each document from English to each language and back to generate a list of texts which are similar but not exactly the same as the original.

Miscellaneous improvements

  • Fix a potential error installing newer versions of sentencepiece (>=0.1.90).
  • Fix an error installing an older version of gensim (<3.8.2).
  • Fix errors running the interactive apps with tiny sample sizes (although you probably weren't trying to run them with 1 document... right?).
  • Fix some encoding errors reading data in the Transformer and SpaCyModel models.
  • Rework charting in benchmark output to prevent timeout errors during benchmarks.
  • Upgraded the version of transformers in the Transformer model to 2.8.0, allowing for use of the ELECTRA model.

v0.1.0 - New models, interactive apps, overhauled benchmarks

13 Mar 17:23
Compare
Choose a tag to compare

This is a large release including many new features:

New models

Implemented support for arbitrary scikit-learn models via SKLearnClassifier and TF-IDF as a baseline embedding approach via TfidfEmbedder. Implemented support for spaCy text categorizer models and spacy-transformers models via SpaCyModel. Upgraded pytorch_transformers v1.0.0 to transformers v2.4.1, which added support for several new models.

Interactive apps

gobbli now comes bundled with a few Streamlit apps that can be used to explore datasets, evaluate gobbli model performance, and generate local explanations for gobbli model predictions. See the docs for more information.

Overhauled benchmarks

Completely overhauled the benchmark framework. Benchmark output is now stored as Markdown files, which can much more easily be read on GitHub, and benchmarks can be selectively rerun when new models are added. Also developed a "benchmark" for embeddings, which plots the model embeddings in 2 dimensions and allows for a qualitative assessment of how well each model differentiates between the classes in the dataset. See the benchmark output folder.

Miscellaneous improvements

  • Add new BERT weights from NCBI trained on PubMed data (ncbi-bert-base-pubmed-uncased, ncbi-bert-base-pubmed-mimic-uncased, ncbi-bert-large-pubmed-uncased, ncbi-bert-large-pubmed-mimic-uncased) (thanks @pmbaumgartner!)
  • Upgrade fastText to a more recent version which supports autotuning parameters.
  • Add support for optional gradient accumulation in Transformer models, allowing for smaller batch sizes and larger models while retaining performance
  • Upgrade USE implementation to the TensorFlow 2.0 version and add support for multilingual weights (universal-sentence-encoder-multilingual, universal-sentence-encoder-multilingual-large)
  • Add a couple of utilities for inspecting and cleaning up disk usage
  • Fix memory issues with USE model by batching input data
  • Fix potential encoding issues with non-ASCII text in USE model
  • Reuse static pretrained weights across instances of models instead of redownloading every time

v0.0.7 - One more PyPI upload fix

22 Oct 13:14
Compare
Choose a tag to compare

Adds requirements.txt (which is required for building some models) to the PyPI upload.

v0.0.6 - Warn about conflicting third-party libraries

25 Sep 14:05
Compare
Choose a tag to compare

Output a warning when gobbli is imported if certain third-party libraries are installed that shadow first-party modules having the same name.

v0.0.5 - Additional PyPI upload fixes

19 Sep 13:06
Compare
Choose a tag to compare

Fix issues with Dockerfiles not being included in wheels, meta.json missing from source distribution, and some Docker-only Python files missing.

Bug fixes - missing Dockerfiles and bad error msg

18 Sep 20:22
Compare
Choose a tag to compare
  • Ensure Dockerfiles are uploaded to PyPI via MANIFEST.in (#9)
  • Better error message when models are run without building first (#8)