Lemmatizer default dictionary missing in 2.0+ #505

CyborgDroid · 2019-05-13T21:02:53Z

The Lemmatizer is missing a default dictionary. This used to be available <2.0, was this intentional? If so, what is a default dictionary I can use?

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma")

error:

java.util.NoSuchElementException: Failed to find a default value for dictionary

Using the PretrainedPipeline Lemmatizer gives an error:

from sparknlp.pretrained import PretrainedPipeline
lemmatizer = PretrainedPipeline('lemma_antbnc', lang='en')
display(lemmatizer.annotate(stn_pipe_df, "TEXT"))

error:

IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.PipelineModel but found class name com.johnsnowlabs.nlp.annotators.LemmatizerModel'
---------------------------------------------------------------------------
IllegalArgumentException                  Traceback (most recent call last)
<command-117803806228513> in <module>()
      1 from sparknlp.pretrained import PretrainedPipeline
      2 
----> 3 lemmatizer = PretrainedPipeline('lemma_antbnc', lang='en')
      4 
      5 display(lemmatizer.annotate(stn_pipe_df, "TEXT"))

/local_disk0/spark-0c207cac-7a16-4636-a36e-6e9aca9b2b3d/userFiles-31269e38-c76a-4744-aee5-9db5a76e1a16/addedFile450520371604220874JohnSnowLabs_spark_nlp_2_0_3-39cc3.jar/sparknlp/pretrained.py in __init__(self, name, lang, remote_loc)
     28 
     29     def __init__(self, name, lang='en', remote_loc=None):
---> 30         self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)
     31         self.light_model = LightPipeline(self.model)
     32 

/local_disk0/spark-0c207cac-7a16-4636-a36e-6e9aca9b2b3d/userFiles-31269e38-c76a-4744-aee5-9db5a76e1a16/addedFile450520371604220874JohnSnowLabs_spark_nlp_2_0_3-39cc3.jar/sparknlp/pretrained.py in downloadPipeline(name, language, remote_loc)
     16     @staticmethod
     17     def downloadPipeline(name, language, remote_loc=None):

System info: Databricks with JohnSnowLabs:spark-nlp:2.0.3

The text was updated successfully, but these errors were encountered:

maziyarpanahi · 2019-05-13T21:17:20Z

Hi @CyborgDroid,
The lemma_antbnc is not a pipeline, it's actually a pre-trained LemmatizerModel with dictionary inside.
You can take a look at here for more info about available models, pipelines, and how to use them online/offline:
https://nlp.johnsnowlabs.com/docs/en/models

If you want to use the Lemma which already trained with default dictionary for English, you can use the pre-trained LemmatizerModel:

val lemma = LemmatizerModel.pretrained("lemma_antbnc", lang="en")

CyborgDroid · 2019-05-24T20:21:42Z

Thank you, that worked!

maziyarpanahi added the bug label May 15, 2019

maziyarpanahi self-assigned this May 15, 2019

maziyarpanahi added question and removed bug labels May 15, 2019

maziyarpanahi closed this as completed May 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmatizer default dictionary missing in 2.0+ #505

Lemmatizer default dictionary missing in 2.0+ #505

CyborgDroid commented May 13, 2019

maziyarpanahi commented May 13, 2019 •

edited

Loading

CyborgDroid commented May 24, 2019

Lemmatizer default dictionary missing in 2.0+ #505

Lemmatizer default dictionary missing in 2.0+ #505

Comments

CyborgDroid commented May 13, 2019

maziyarpanahi commented May 13, 2019 • edited Loading

CyborgDroid commented May 24, 2019

maziyarpanahi commented May 13, 2019 •

edited

Loading