Skip to content

Unclear documentation on how to properly use the POSTagger #41

@rylanhalteman

Description

@rylanhalteman

Description

The documentation does not provide a clear way to run the POSTagger. The annotator documentation gives the following snippet:

val posTagger = new PerceptronApproach()
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("pos")

However, using this snippet results in a NullPointer exception rather than running.

Expected Behavior

It would be expected that adding this snippet into a reasonable workflow, such as the one provided in the Quickstart documentation this could be added to the pipeline without crashing.

Current Behavior

Adding the POSTagger to the pipeline results in a NullPointer exception

scala> pipeline.fit(data).transform(data).show()
java.lang.NullPointerException
  at java.io.FilterInputStream.read(FilterInputStream.java:133)
  at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
  at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
  at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
  at java.io.InputStreamReader.read(InputStreamReader.java:184)
  at java.io.BufferedReader.fill(BufferedReader.java:161)
  at java.io.BufferedReader.readLine(BufferedReader.java:324)
  at java.io.BufferedReader.readLine(BufferedReader.java:389)
  at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
  at scala.collection.AbstractIterator.to(Iterator.scala:1336)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
  at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
  at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach$.parsePOSCorpusFromDir(PerceptronApproach.scala:227)
  at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach$.retrievePOSCorpus(PerceptronApproach.scala:246)
  at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach.train(PerceptronApproach.scala:84)
  at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach.train(PerceptronApproach.scala:22)
  at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:28)
  at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
  at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
  at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
  at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)
  ... 54 elided

Possible Solution

The documentation mentions a setCorpusPath config method. From my brief perusal of the code, it appears that using setting the corpus path is required since it does not have a default value. If that is the case, how to set the corpus path should be explained in the documentation along with a full example. Ideally one would not need to specify a corpus, or this library would provide pre-trained models on various corpuses.

Steps to Reproduce

Enter into the spark shell using

spark-shell  --packages JohnSnowLabs:spark-nlp:1.2.2

and then run the following code

import com.johnsnowlabs.nlp._
import com.johnsnowlabs.nlp.annotators._
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetectorModel
import org.apache.spark.ml.Pipeline

import spark.implicits._
import spark.sql

// Used my own data, adding the data from the notebook as an example
data = spark.read.parquet("../sentiment.parquet").limit(1000)

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")


val sentenceDetector = new SentenceDetectorModel().setInputCols(Array("document")).setOutputCol("sentence")

val regexTokenizer = new RegexTokenizer().setInputCols(Array("sentence")).setOutputCol("token")

val posTagger = new PerceptronApproach().setInputCols(Array("sentence", "token")).setOutputCol("pos")

val finisher = new Finisher().setInputCols("pos").setCleanAnnotations(false)

val pipeline = new Pipeline().setStages(Array(
        documentAssembler,
        sentenceDetector,
        regexTokenizer,
        posTagger,
        finisher
    ))

pipeline.fit(data).transform(data).show()

Context

I was trying to pass data to what I assumed was a pre-trained model for use in an NLP pipeline.

Your Environment

Spark version: 2.1.1
spark-nlp version: 1.2.2
Running on Amazon EMR

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions