![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark-NLP and Spark ML Pipelines

## Simple Topic Modeling

`Spark-NLP`
* Tokenizing
* Normalizing
* POS tagging
* Chunking

`Spark ML`
* Hashing
* TF-IDF
* LDA

In [17]:
import sys

from pyspark.sql.functions import col
from pyspark.ml.feature import CountVectorizer, HashingTF, IDF, Tokenizer
from pyspark.ml.clustering import LDA, LDAModel

#Spark NLP
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import *

### Let's create a Spark Session for our app

In [2]:
spark = sparknlp.start()

Let's download some scientific sample from PubMed dataset:
```
wget -N 	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv -P /tmp
```

In [3]:
! wget -N 	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv -P /tmp

--2019-03-23 11:36:51--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.168.221
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.168.221|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘/tmp/pubmed-sample.csv’ not modified on server. Omitting download.



In [4]:
pubMedDF = spark.read\
                .option("header", "true")\
                .csv("/tmp/pubmed-sample.csv")\
                .filter("AB IS NOT null")\
                .withColumn("text", col("AB"))\
                .drop("TI", "AB")

In [5]:
pubMedDF.printSchema()
pubMedDF.show()

root
 |-- text: string (nullable = true)

+--------------------+
|                text|
+--------------------+
|The human KCNJ9 (...|
|BACKGROUND: At pr...|
|OBJECTIVE: To inv...|
|Combined EEG/fMRI...|
|Kohlschutter synd...|
|Statistical analy...|
|The synthetic DOX...|
|Our objective was...|
|We conducted a ph...|
|"Monomeric sarcos...|
|We presented the ...|
|The literature de...|
|A novel approach ...|
|An HPLC-ESI-MS-MS...|
|The localizing an...|
|OBJECTIVE: To eva...|
|For the construct...|
|We report the res...|
|Intraparenchymal ...|
|It is known that ...|
+--------------------+
only showing top 20 rows



In [6]:
pubMedDF.count()

7537

### Let's create Spark-NLP Pipeline

In [7]:
%%time
# Spark-NLP Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
  .setInputCols(["token"]) \
  .setOutputCol("normalized") \
  .setLowercase(True)

posTagger = PerceptronModel.pretrained() \
  .setInputCols(["sentence", "normalized"])

chunker = Chunker() \
    .setInputCols(["sentence", "pos"]) \
    .setOutputCol("chunk") \
    .setRegexParsers(["<NNP>+", "<DT|PP\\$>?<JJ>*<NN>"])

finisher = Finisher() \
  .setInputCols(["chunk"]) \
  .setIncludeMetadata(False)

nlpPipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer,
    normalizer,
    posTagger,
    chunker,
    finisher
])

CPU times: user 13.9 ms, sys: 3.5 ms, total: 17.4 ms
Wall time: 4.89 s


In [8]:
%%time
nlpPipelineDF = nlpPipeline.fit(pubMedDF).transform(pubMedDF)

CPU times: user 50.7 ms, sys: 15 ms, total: 65.7 ms
Wall time: 646 ms


### Let's create Spark ML Pipeline

In [11]:
%%time
# SPark ML Pipeline

cv = CountVectorizer(inputCol="finished_chunk", outputCol="features", vocabSize=3, minDF=10.0)
idf = IDF(inputCol="features", outputCol="idf")
lda = LDA(k=10, maxIter=10)
### Let's create Spark-NLP Pipeline
mlPipeline = Pipeline(stages=[
    cv,
    idf,
    lda
])

CPU times: user 4.12 ms, sys: 2.43 ms, total: 6.55 ms
Wall time: 28.3 ms


### We are going to train Spark ML Pipeline by using Spark-NLP Pipeline

In [12]:
### Let's create Spark-NLP Pipeline%%time
mlModel = mlPipeline.fit(nlpPipelineDF)

CPU times: user 223 ms, sys: 190 ms, total: 413 ms
Wall time: 29min 19s


In [23]:
%%time
mlPipelineDF = mlModel.transform(nlpPipelineDF)

CPU times: user 17 ms, sys: 7.83 ms, total: 24.8 ms
Wall time: 145 ms


In [24]:
mlPipelineDF.show()

+--------------------+--------------------+-------------------+--------------------+--------------------+
|                text|      finished_chunk|           features|                 idf|   topicDistribution|
+--------------------+--------------------+-------------------+--------------------+--------------------+
|The human KCNJ9 (...|[The human KCNJ, ...|          (3,[],[])|           (3,[],[])|[0.0,0.0,0.0,0.0,...|
|BACKGROUND: At pr...|[BACKGROUND, the ...|      (3,[0],[2.0])|(3,[0],[5.0816657...|[0.03203122232777...|
|OBJECTIVE: To inv...|[OBJECTIVE, the r...|          (3,[],[])|           (3,[],[])|[0.0,0.0,0.0,0.0,...|
|Combined EEG/fMRI...|[EEG/fMR, recordi...|          (3,[],[])|           (3,[],[])|[0.0,0.0,0.0,0.0,...|
|Kohlschutter synd...|[Kohlschutter, sy...|          (3,[],[])|           (3,[],[])|[0.0,0.0,0.0,0.0,...|
|Statistical analy...|[Statistical anal...|          (3,[],[])|           (3,[],[])|[0.0,0.0,0.0,0.0,...|
|The synthetic DOX...|[The synthetic DO...|   

In [25]:
%%time
ldaModel = mlModel.stages[2]

ll = ldaModel.logLikelihood(mlPipelineDF)
lp = ldaModel.logPerplexity(mlPipelineDF)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))


The lower bound on the log likelihood of the entire corpus: -6398.161171643155
The upper bound on perplexity: 1.251106994846139


In [26]:
%%time
# Describe topics.
topics = ldaModel.describeTopics(3)
print("The topics described by their top-weighted terms:")
topics.show(truncate=False)


The topics described by their top-weighted terms:
+-----+-----------+------------------------------------------------------------------+
|topic|termIndices|termWeights                                                       |
+-----+-----------+------------------------------------------------------------------+
|0    |[2, 0, 1]  |[0.7876887969680446, 0.1346394774230584, 0.07767172560889692]     |
|1    |[1, 2, 0]  |[0.9430415347300795, 0.030109588311868034, 0.026848876958052586]  |
|2    |[2, 0, 1]  |[0.9935477161368587, 0.004148321791835057, 0.0023039620713063338] |
|3    |[0, 2, 1]  |[0.936065526361783, 0.057030930370560715, 0.006903543267656342]   |
|4    |[0, 2, 1]  |[0.35467818364632037, 0.33720574198641157, 0.30811607436726807]   |
|5    |[0, 2, 1]  |[0.3606145766335865, 0.3244837665839638, 0.3149016567824498]      |
|6    |[2, 0, 1]  |[0.3432113103200083, 0.33038981935530154, 0.3263988703246901]     |
|7    |[0, 2, 1]  |[0.41745612904290647, 0.3100640016172902, 0.2724798693398032]

In [41]:
# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize())
      + " words):")

topics = ldaModel.topicsMatrix()

for topic in range(3):
    print("Topic " + str(topic) + ":" + str(topics))
#     for word in range(0, ldaModel.vocabSize()):
#         print(" " + str(topics[word][topic]))


Learned topics (as distributions over vocab of 3 words):
Topic 0:DenseMatrix([[  1.29551777,   0.93421984,   1.35343263, 108.21335739,
                0.73581343,   0.80706986,   0.82055818,   3.00094254,
                1.44194442, 486.82380145],
             [  0.74736699,  32.81359253,   0.75169131,   0.79808045,
                0.63921593,   0.70476251,   0.81064624,   1.95876016,
              409.01028234,  17.72665005],
             [  7.57923943,   1.04767789, 324.15515726,   6.59303038,
                0.6995652 ,   0.72620766,   0.85240172,   2.22893902,
                2.80155065,   2.40319454]])
Topic 1:DenseMatrix([[  1.29551777,   0.93421984,   1.35343263, 108.21335739,
                0.73581343,   0.80706986,   0.82055818,   3.00094254,
                1.44194442, 486.82380145],
             [  0.74736699,  32.81359253,   0.75169131,   0.79808045,
                0.63921593,   0.70476251,   0.81064624,   1.95876016,
              409.01028234,  17.72665005],
           