![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP and Spark ML Pipelines

## Simple Topic Modeling

`Spark-NLP`
* DocumentAssembler
* SentenceDetector
* Tokenizer
* Normalizer
* POS tagger
* Chunker
* Finisher

`Spark ML`
* Hashing
* TF-IDF
* LDA

In [1]:
import sys
import time

from pyspark.sql.functions import col
from pyspark.ml.feature import CountVectorizer, HashingTF, IDF, Tokenizer
from pyspark.ml.clustering import LDA, LDAModel

#Spark NLP
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import *

### Let's create a Spark Session for our app

In [2]:
spark = sparknlp.start()

print("Spark NLP version")
sparknlp.version()
print("Apache Spark version")
spark.version

Spark NLP version
2.2.0
Apache Spark version


'2.4.3'

Let's download some scientific sample from PubMed dataset:
```
wget -N 	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv -P /tmp
```

In [3]:
! wget -N 	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv -P /tmp

--2019-09-10 15:25:20--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.186.5
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.186.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10484510 (10.0M) [text/csv]
Saving to: ‘/tmp/pubmed-sample.csv’


2019-09-10 15:25:23 (4.80 MB/s) - ‘/tmp/pubmed-sample.csv’ saved [10484510/10484510]



In [4]:
pubMedDF = spark.read\
                .option("header", "true")\
                .csv("/tmp/pubmed-sample.csv")\
                .filter("AB IS NOT null")\
                .withColumn("text", col("AB"))\
                .drop("TI", "AB")

In [5]:
pubMedDF.printSchema()
pubMedDF.show()

root
 |-- text: string (nullable = true)

+--------------------+
|                text|
+--------------------+
|The human KCNJ9 (...|
|BACKGROUND: At pr...|
|OBJECTIVE: To inv...|
|Combined EEG/fMRI...|
|Kohlschutter synd...|
|Statistical analy...|
|The synthetic DOX...|
|Our objective was...|
|We conducted a ph...|
|"Monomeric sarcos...|
|We presented the ...|
|The literature de...|
|A novel approach ...|
|An HPLC-ESI-MS-MS...|
|The localizing an...|
|OBJECTIVE: To eva...|
|For the construct...|
|We report the res...|
|Intraparenchymal ...|
|It is known that ...|
+--------------------+
only showing top 20 rows



In [6]:
pubMedDF.count()

7537

### Let's create Spark-NLP Pipeline

In [7]:
# Spark NLP Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"])

chunker = Chunker() \
    .setInputCols(["sentence", "pos"]) \
    .setOutputCol("chunk") \
    .setRegexParsers(["<NNP>+", "<DT>?<JJ>*<NN>"])

finisher = Finisher() \
  .setInputCols(["chunk"]) \
  .setIncludeMetadata(False)

nlpPipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer,
    posTagger,
    chunker,
    finisher
])

pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]


In [8]:
nlpPipelineDF = nlpPipeline.fit(pubMedDF).transform(pubMedDF)

### Let's create Spark ML Pipeline

In [9]:
# SPark ML Pipeline

cv = CountVectorizer(inputCol="finished_chunk", outputCol="features", vocabSize=1000, minDF=10.0, minTF=10.0)
idf = IDF(inputCol="features", outputCol="idf")
lda = LDA(k=10, maxIter=5)
### Let's create Spark-NLP Pipeline
mlPipeline = Pipeline(stages=[
    cv,
    idf,
    lda
])

### We are going to train Spark ML Pipeline by using Spark-NLP Pipeline

In [10]:
# Let's create Spark-NLP Pipeline
mlModel = mlPipeline.fit(nlpPipelineDF)

In [11]:
mlPipelineDF = mlModel.transform(nlpPipelineDF)

In [12]:
mlPipelineDF.show()

+--------------------+--------------------+------------+------------+--------------------+
|                text|      finished_chunk|    features|         idf|   topicDistribution|
+--------------------+--------------------+------------+------------+--------------------+
|The human KCNJ9 (...|[KCNJ9, Kir, GIRK...|(1000,[],[])|(1000,[],[])|[0.0,0.0,0.0,0.0,...|
|BACKGROUND: At pr...|[BACKGROUND, the ...|(1000,[],[])|(1000,[],[])|[0.0,0.0,0.0,0.0,...|
|OBJECTIVE: To inv...|[OBJECTIVE, =9796...|(1000,[],[])|(1000,[],[])|[0.0,0.0,0.0,0.0,...|
|Combined EEG/fMRI...|[Combined EEG/fMR...|(1000,[],[])|(1000,[],[])|[0.0,0.0,0.0,0.0,...|
|Kohlschutter synd...|[Kohlschutter, sy...|(1000,[],[])|(1000,[],[])|[0.0,0.0,0.0,0.0,...|
|Statistical analy...|[Statistical, ana...|(1000,[],[])|(1000,[],[])|[0.0,0.0,0.0,0.0,...|
|The synthetic DOX...|[DOX-LNA, conjuga...|(1000,[],[])|(1000,[],[])|[0.0,0.0,0.0,0.0,...|
|Our objective was...|[objective, blood...|(1000,[],[])|(1000,[],[])|[0.0,0.0,0.0,0.0,...|

In [13]:
ldaModel = mlModel.stages[2]

In [14]:
ll = ldaModel.logLikelihood(mlPipelineDF)
lp = ldaModel.logPerplexity(mlPipelineDF)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))


The lower bound on the log likelihood of the entire corpus: -19456.289390402628
The upper bound on perplexity: 12.708222985240123


In [15]:
# Describe topics.
print("The topics described by their top-weighted terms:")
ldaModel.describeTopics(3).show(truncate=False)

The topics described by their top-weighted terms:
+-----+---------------+---------------------------------------------------------------------+
|topic|termIndices    |termWeights                                                          |
+-----+---------------+---------------------------------------------------------------------+
|0    |[129, 85, 162] |[0.009857518024483591, 0.007910888626305495, 0.0013025172935866678]  |
|1    |[300, 927, 770]|[0.007935607563269446, 0.0074662423167941964, 0.0013577441277888302] |
|2    |[3, 6, 0]      |[0.042247589630979306, 0.02338549033457705, 0.021292477055820495]    |
|3    |[19, 12, 645]  |[0.008850993650447876, 0.008838468891365777, 0.0013980774883356314]  |
|4    |[22, 134, 57]  |[0.027669589971554293, 0.013629145818395184, 0.007339287603627411]   |
|5    |[51, 537, 367] |[0.008808355396129218, 0.0075699921801503384, 0.0013106696237524754] |
|6    |[774, 384, 332]|[0.009588666613626525, 0.008160356153952601, 0.007368502499050751]   |
|7    |[1,

### Let's look at out topics
NOTE: More cleaning, filtering, playing around with `CountVectorizer`, and more iterations in `LDA` will result in better Topic Modelling results.

In [16]:
# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize())
      + " words):")

topics = ldaModel.describeTopics(50)
topics_rdd = topics.rdd

vocab = mlModel.stages[0].vocabulary

topics_words = topics_rdd\
       .map(lambda row: row['termIndices'])\
       .map(lambda idx_list: [vocab[idx] for idx in idx_list])\
       .collect()

for idx, topic in enumerate(topics_words):
    print("topic: ", idx)
    print("----------")
    for word in topic:
       print(word)
    print("----------")

Learned topics (as distributions over vocab of 1000 words):
topic:  0
----------
dose
glucose
increase
platelet
length
a combination
side
degradation
ethanol
conclusion
reaction
a patient
performance
myocardial
PCR
model
Doppler
safety
epilepsy
this review
presentation
electron
health
resistance
prevalence
AR
MR
neuronal
course
family
transfer
MS
restriction
matrix
evaluation
emission
ability
ischemia
monitoring
account
CF
implantation
surgery
acetate
outcome
the patient
validity
a reduction
cortex
correlation
----------
topic:  1
----------
cm
DCs
stimuli
energy
choice
injection
factor
neck
food
Type
the study
gradient
milk
class
ratio
the method
no significant difference
animal
IL-10
accumulation
kDa
a variety
>
amount
receptor
The method
pressure
heat
peptide
hydrogen
i.e
HCV
gender
Data
incidence
Doppler
utilization
test
adhesion
virus
model
patientswith
enhancement
location
the type
sperm
outcome
the range
the frequency
secretion
----------
topic:  2
----------
group
+/
).
P
HF
di