![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark-NLP and Spark ML Pipelines

## Simple Topic Modeling

`Spark-NLP`
* DocumentAssembler
* SentenceDetector
* Tokenizer
* Normalizer
* POS tagger
* Chunker
* Finisher

`Spark ML`
* Hashing
* TF-IDF
* LDA

In [1]:
import sys

from pyspark.sql.functions import col
from pyspark.ml.feature import CountVectorizer, HashingTF, IDF, Tokenizer
from pyspark.ml.clustering import LDA, LDAModel

#Spark NLP
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import *

### Let's create a Spark Session for our app

In [2]:
spark = sparknlp.start()

Let's download some scientific sample from PubMed dataset:
```
wget -N 	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv -P /tmp
```

In [3]:
! wget -N 	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv -P /tmp

--2019-03-23 20:02:11--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.18.11
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.18.11|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘/tmp/pubmed-sample.csv’ not modified on server. Omitting download.



In [4]:
pubMedDF = spark.read\
                .option("header", "true")\
                .csv("/tmp/pubmed-sample.csv")\
                .filter("AB IS NOT null")\
                .withColumn("text", col("AB"))\
                .drop("TI", "AB")

In [5]:
pubMedDF.printSchema()
pubMedDF.show()

root
 |-- text: string (nullable = true)

+--------------------+
|                text|
+--------------------+
|The human KCNJ9 (...|
|BACKGROUND: At pr...|
|OBJECTIVE: To inv...|
|Combined EEG/fMRI...|
|Kohlschutter synd...|
|Statistical analy...|
|The synthetic DOX...|
|Our objective was...|
|We conducted a ph...|
|"Monomeric sarcos...|
|We presented the ...|
|The literature de...|
|A novel approach ...|
|An HPLC-ESI-MS-MS...|
|The localizing an...|
|OBJECTIVE: To eva...|
|For the construct...|
|We report the res...|
|Intraparenchymal ...|
|It is known that ...|
+--------------------+
only showing top 20 rows



In [6]:
pubMedDF.count()

7537

### Let's create Spark-NLP Pipeline

In [7]:
%%time
# Spark-NLP Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized") \
  .setLowercase(True)

posTagger = PerceptronModel.pretrained() \
  .setInputCols(["sentence", "normalized"])

chunker = Chunker() \
    .setInputCols(["sentence", "pos"]) \
    .setOutputCol("chunk") \
    .setRegexParsers(["<NNP>+", "<DT|PP\\$>?<JJ>*<NN>"])

finisher = Finisher() \
  .setInputCols(["chunk"]) \
  .setIncludeMetadata(False)

nlpPipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer,
    normalizer,
    posTagger,
    chunker,
    finisher
])

CPU times: user 15.2 ms, sys: 5.51 ms, total: 20.7 ms
Wall time: 5.67 s


In [8]:
%%time
nlpPipelineDF = nlpPipeline.fit(pubMedDF).transform(pubMedDF)

CPU times: user 56.6 ms, sys: 17.4 ms, total: 74 ms
Wall time: 749 ms


### Let's create Spark ML Pipeline

In [9]:
%%time
# SPark ML Pipeline

cv = CountVectorizer(inputCol="finished_chunk", outputCol="features", vocabSize=3000, minDF=10.0)
idf = IDF(inputCol="features", outputCol="idf")
lda = LDA(k=10, maxIter=1)
### Let's create Spark-NLP Pipeline
mlPipeline = Pipeline(stages=[
    cv,
    idf,
    lda
])

CPU times: user 5.45 ms, sys: 3.45 ms, total: 8.89 ms
Wall time: 52 ms


### We are going to train Spark ML Pipeline by using Spark-NLP Pipeline

In [10]:
### Let's create Spark-NLP Pipeline%%time
mlModel = mlPipeline.fit(nlpPipelineDF)

In [11]:
%%time
mlPipelineDF = mlModel.transform(nlpPipelineDF)

CPU times: user 15 ms, sys: 7.55 ms, total: 22.5 ms
Wall time: 116 ms


In [12]:
mlPipelineDF.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|      finished_chunk|            features|                 idf|   topicDistribution|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|The human KCNJ9 (...|[The human KCNJ, ...|(3000,[6,12,28,72...|(3000,[6,12,28,72...|[0.00368146035207...|
|BACKGROUND: At pr...|[BACKGROUND, the ...|(3000,[0,4,14,15,...|(3000,[0,4,14,15,...|[0.44577090261779...|
|OBJECTIVE: To inv...|[OBJECTIVE, the r...|(3000,[7,19,44,56...|(3000,[7,19,44,56...|[0.00310604411566...|
|Combined EEG/fMRI...|[EEG/fMR, recordi...|(3000,[13,24,46,1...|(3000,[13,24,46,1...|[0.00473349830410...|
|Kohlschutter synd...|[Kohlschutter, sy...|(3000,[110,645,90...|(3000,[110,645,90...|[0.02488645895143...|
|Statistical analy...|[Statistical anal...|(3000,[9,32,60,19...|(3000,[9,32,60,19...|[0.00764828599675...|
|The synthetic DOX...|[The synthetic 

In [13]:
%%time
ldaModel = mlModel.stages[2]

ll = ldaModel.logLikelihood(mlPipelineDF)
lp = ldaModel.logPerplexity(mlPipelineDF)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))


The lower bound on the log likelihood of the entire corpus: -1427616.1489268562
The upper bound on perplexity: 8.376505148282039
CPU times: user 44.1 ms, sys: 38.4 ms, total: 82.5 ms
Wall time: 6min 22s


In [14]:
%%time
# Describe topics.
topics = ldaModel.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)


The topics described by their top-weighted terms:
+-----+-------------+---------------------------------------------------------------------+
|topic|termIndices  |termWeights                                                          |
+-----+-------------+---------------------------------------------------------------------+
|0    |[0, 2, 5]    |[0.013919034901292178, 0.006700064793232536, 0.003015485586985393]   |
|1    |[1, 2, 77]   |[0.0034154322431057243, 0.0027432705547898095, 0.0020877414099877895]|
|2    |[108, 25, 8] |[0.004633856073583384, 0.0023776985723297345, 0.002026254388192705]  |
|3    |[7, 6, 45]   |[0.003100067737830986, 0.003057422598986175, 0.002382902473412686]   |
|4    |[59, 14, 211]|[0.0024608783220347666, 0.0020297154176113497, 0.0018962959098669185]|
|5    |[111, 36, 0] |[0.0020101336462628257, 0.0019800534802662255, 0.001965065438641306] |
|6    |[22, 46, 6]  |[0.002519452910460417, 0.0023172448672150424, 0.0022960831624377676] |
|7    |[1, 7, 3]    |[0.005489

In [15]:
# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize())
      + " words):")

topics = ldaModel.topicsMatrix()

for topic in range(3):
    print("Topic " + str(topic) + ":" + str(topics))

Learned topics (as distributions over vocab of 3000 words):
Topic 0:DenseMatrix([[50.73783652,  5.04405936,  0.78578643, ...,  3.81858649,
               8.53652028,  0.69201923],
             [ 9.76344508, 11.50356713,  2.16274875, ..., 18.54846812,
               4.93117058,  5.15730655],
             [24.42315825,  9.23964955,  1.35829732, ...,  0.92189725,
               3.79467484,  2.62953158],
             ...,
             [ 0.94607864,  0.90747457,  1.16281851, ...,  0.9498116 ,
               0.8931674 ,  1.03045281],
             [ 0.91667884,  0.99775866,  0.83649044, ...,  0.97774381,
               0.93629883,  0.91163679],
             [ 1.07634159,  1.45076257,  0.86056615, ...,  1.01649731,
               1.1209169 ,  1.02465341]])
Topic 1:DenseMatrix([[50.73783652,  5.04405936,  0.78578643, ...,  3.81858649,
               8.53652028,  0.69201923],
             [ 9.76344508, 11.50356713,  2.16274875, ..., 18.54846812,
               4.93117058,  5.15730655],
         