-
Notifications
You must be signed in to change notification settings - Fork 732
Description
Description
The documentation does not provide a clear way to run the POSTagger. The annotator documentation gives the following snippet:
val posTagger = new PerceptronApproach()
.setInputCols(Array("sentence", "token"))
.setOutputCol("pos")
However, using this snippet results in a NullPointer exception rather than running.
Expected Behavior
It would be expected that adding this snippet into a reasonable workflow, such as the one provided in the Quickstart documentation this could be added to the pipeline without crashing.
Current Behavior
Adding the POSTagger to the pipeline results in a NullPointer exception
scala> pipeline.fit(data).transform(data).show()
java.lang.NullPointerException
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:161)
at java.io.BufferedReader.readLine(BufferedReader.java:324)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach$.parsePOSCorpusFromDir(PerceptronApproach.scala:227)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach$.retrievePOSCorpus(PerceptronApproach.scala:246)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach.train(PerceptronApproach.scala:84)
at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach.train(PerceptronApproach.scala:22)
at com.johnsnowlabs.nlp.AnnotatorApproach.fit(AnnotatorApproach.scala:28)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:153)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:149)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:44)
at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:37)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:149)
... 54 elided
Possible Solution
The documentation mentions a setCorpusPath
config method. From my brief perusal of the code, it appears that using setting the corpus path is required since it does not have a default value. If that is the case, how to set the corpus path should be explained in the documentation along with a full example. Ideally one would not need to specify a corpus, or this library would provide pre-trained models on various corpuses.
Steps to Reproduce
Enter into the spark shell using
spark-shell --packages JohnSnowLabs:spark-nlp:1.2.2
and then run the following code
import com.johnsnowlabs.nlp._
import com.johnsnowlabs.nlp.annotators._
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetectorModel
import org.apache.spark.ml.Pipeline
import spark.implicits._
import spark.sql
// Used my own data, adding the data from the notebook as an example
data = spark.read.parquet("../sentiment.parquet").limit(1000)
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val sentenceDetector = new SentenceDetectorModel().setInputCols(Array("document")).setOutputCol("sentence")
val regexTokenizer = new RegexTokenizer().setInputCols(Array("sentence")).setOutputCol("token")
val posTagger = new PerceptronApproach().setInputCols(Array("sentence", "token")).setOutputCol("pos")
val finisher = new Finisher().setInputCols("pos").setCleanAnnotations(false)
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
regexTokenizer,
posTagger,
finisher
))
pipeline.fit(data).transform(data).show()
Context
I was trying to pass data to what I assumed was a pre-trained model for use in an NLP pipeline.
Your Environment
Spark version: 2.1.1
spark-nlp version: 1.2.2
Running on Amazon EMR