Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemmatizer default dictionary missing in 2.0+ #505

Closed
CyborgDroid opened this issue May 13, 2019 · 2 comments
Closed

Lemmatizer default dictionary missing in 2.0+ #505

CyborgDroid opened this issue May 13, 2019 · 2 comments
Assignees
Labels

Comments

@CyborgDroid
Copy link

The Lemmatizer is missing a default dictionary. This used to be available <2.0, was this intentional? If so, what is a default dictionary I can use?

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma")

error:

java.util.NoSuchElementException: Failed to find a default value for dictionary

Using the PretrainedPipeline Lemmatizer gives an error:

from sparknlp.pretrained import PretrainedPipeline
lemmatizer = PretrainedPipeline('lemma_antbnc', lang='en')
display(lemmatizer.annotate(stn_pipe_df, "TEXT"))

error:

IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.PipelineModel but found class name com.johnsnowlabs.nlp.annotators.LemmatizerModel'
---------------------------------------------------------------------------
IllegalArgumentException                  Traceback (most recent call last)
<command-117803806228513> in <module>()
      1 from sparknlp.pretrained import PretrainedPipeline
      2 
----> 3 lemmatizer = PretrainedPipeline('lemma_antbnc', lang='en')
      4 
      5 display(lemmatizer.annotate(stn_pipe_df, "TEXT"))

/local_disk0/spark-0c207cac-7a16-4636-a36e-6e9aca9b2b3d/userFiles-31269e38-c76a-4744-aee5-9db5a76e1a16/addedFile450520371604220874JohnSnowLabs_spark_nlp_2_0_3-39cc3.jar/sparknlp/pretrained.py in __init__(self, name, lang, remote_loc)
     28 
     29     def __init__(self, name, lang='en', remote_loc=None):
---> 30         self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)
     31         self.light_model = LightPipeline(self.model)
     32 

/local_disk0/spark-0c207cac-7a16-4636-a36e-6e9aca9b2b3d/userFiles-31269e38-c76a-4744-aee5-9db5a76e1a16/addedFile450520371604220874JohnSnowLabs_spark_nlp_2_0_3-39cc3.jar/sparknlp/pretrained.py in downloadPipeline(name, language, remote_loc)
     16     @staticmethod
     17     def downloadPipeline(name, language, remote_loc=None):

System info: Databricks with JohnSnowLabs:spark-nlp:2.0.3

Screen Shot 2019-05-13 at 4 57 57 PM

@maziyarpanahi
Copy link
Member

maziyarpanahi commented May 13, 2019

Hi @CyborgDroid,
The lemma_antbnc is not a pipeline, it's actually a pre-trained LemmatizerModel with dictionary inside.
You can take a look at here for more info about available models, pipelines, and how to use them online/offline:
https://nlp.johnsnowlabs.com/docs/en/models

If you want to use the Lemma which already trained with default dictionary for English, you can use the pre-trained LemmatizerModel:

val lemma = LemmatizerModel.pretrained("lemma_antbnc", lang="en")

@maziyarpanahi maziyarpanahi self-assigned this May 15, 2019
@CyborgDroid
Copy link
Author

Thank you, that worked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants