Skip to content

John Snow Labs Spark-NLP 2.2.0: BERT improvements, OCR Coordinates, python evaluation

Compare
Choose a tag to compare
@saif-ellafi saif-ellafi released this 23 Aug 06:06
· 6026 commits to master since this release

Last time, following a release candidate schedule proved to be a quite effective method to avoid silly bugs right after release!
Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests. This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!


New Features

  • OCRHelper now returns coordinate positions matrix for text converted from PDF
  • New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
  • Evaluation module now also ported to Python
  • WordEmbeddings now include coverage metadata information and new static functions withCoverageColumn and overallCoverage offer metric analysis
  • NerDL Now has includeConfidence param that enables confidence scores on prediction metadata
  • NerDLApproach now has enableOutputLog outputs training metric logs to file
  • New Param in BERT poolingLayer allows for polling layer selection

Enhancements

  • BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
  • Progress bar and size estimate report when downloading pretrained models and loading embeddings
  • Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
  • Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
  • Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
  • PretrainedPipelines now allow function fullAnnotate to retrieve fully information of Annotations
  • DocumentAssembler new cleanup modes: each, each_full and delete_full allow more control over text cleaning up (different ways of dealing with new lines and tabs)

Bugfixes

  • Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
  • Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
  • Fixed missing setters for whitelist param in NerConverter
  • Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized
  • Fixed a bug where ResourceDownloader conflicted S3 credentials with public model access (Thank you Dimitris Manikis)
  • Fixed Context Spell Checker bugs with performance improvements (pretrained model disabled until we get a better one)