![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP and Spark ML Pipelines

## Simple Topic Modeling

`Spark-NLP`
* DocumentAssembler
* SentenceDetector
* Tokenizer
* Normalizer
* POS tagger
* Chunker
* Finisher

`Spark ML`
* Hashing
* TF-IDF
* LDA

In [2]:
import sys
import time

from pyspark.sql.functions import col
from pyspark.ml.feature import CountVectorizer, HashingTF, IDF, Tokenizer
from pyspark.ml.clustering import LDA, LDAModel

#Spark NLP
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import *

### Let's create a Spark Session for our app

In [3]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  2.4.2
Apache Spark version:  2.4.4


Let's download some scientific sample from PubMed dataset:
```
wget -N 	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv -P /tmp
```

In [4]:
! wget -N 	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv -P /tmp

--2020-06-29 16:32:06--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.26.158
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.26.158|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10484510 (10.0M) [text/csv]
Saving to: ‘/tmp/pubmed-sample.csv’


2020-06-29 16:32:07 (16.1 MB/s) - ‘/tmp/pubmed-sample.csv’ saved [10484510/10484510]



In [5]:
pubMedDF = spark.read\
                .option("header", "true")\
                .csv("/tmp/pubmed-sample.csv")\
                .filter("AB IS NOT null")\
                .withColumn("text", col("AB"))\
                .drop("TI", "AB")

In [6]:
pubMedDF.printSchema()
pubMedDF.show()
print('rows', pubMedDF.count())
pubMedDF = pubMedDF.limit(200) #minimize dataset if you are not running on a cluster

root
 |-- text: string (nullable = true)

+--------------------+
|                text|
+--------------------+
|The human KCNJ9 (...|
|BACKGROUND: At pr...|
|OBJECTIVE: To inv...|
|Combined EEG/fMRI...|
|Kohlschutter synd...|
|Statistical analy...|
|The synthetic DOX...|
|Our objective was...|
|We conducted a ph...|
|"Monomeric sarcos...|
|We presented the ...|
|The literature de...|
|A novel approach ...|
|An HPLC-ESI-MS-MS...|
|The localizing an...|
|OBJECTIVE: To eva...|
|For the construct...|
|We report the res...|
|Intraparenchymal ...|
|It is known that ...|
+--------------------+
only showing top 20 rows

rows 7537


### Let's create Spark-NLP Pipeline

In [7]:
# Spark NLP Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"])

chunker = Chunker() \
    .setInputCols(["sentence", "pos"]) \
    .setOutputCol("chunk") \
    .setRegexParsers(["<NNP>+", "<DT>?<JJ>*<NN>"])

finisher = Finisher() \
  .setInputCols(["chunk"]) \
  .setIncludeMetadata(False)

nlpPipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer,
    posTagger,
    chunker,
    finisher
])

pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]


In [8]:
nlpPipelineDF = nlpPipeline.fit(pubMedDF).transform(pubMedDF)

### Let's create Spark ML Pipeline

In [9]:
# SPark ML Pipeline

cv = CountVectorizer(inputCol="finished_chunk", outputCol="features", vocabSize=1000, minDF=10.0, minTF=10.0)
idf = IDF(inputCol="features", outputCol="idf")
lda = LDA(k=10, maxIter=5)
### Let's create Spark-NLP Pipeline
mlPipeline = Pipeline(stages=[
    cv,
    idf,
    lda
])

### We are going to train Spark ML Pipeline by using Spark-NLP Pipeline

In [10]:
# Let's create Spark-NLP Pipeline
mlModel = mlPipeline.fit(nlpPipelineDF)

In [11]:
mlPipelineDF = mlModel.transform(nlpPipelineDF)

In [12]:
mlPipelineDF.show()

+--------------------+--------------------+----------+----------+--------------------+
|                text|      finished_chunk|  features|       idf|   topicDistribution|
+--------------------+--------------------+----------+----------+--------------------+
|The human KCNJ9 (...|[KCNJ9, Kir, GIRK...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|BACKGROUND: At pr...|[BACKGROUND, the ...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|OBJECTIVE: To inv...|[OBJECTIVE, =9796...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|Combined EEG/fMRI...|[Combined EEG/fMR...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|Kohlschutter synd...|[Kohlschutter, sy...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|Statistical analy...|[Statistical, ana...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|The synthetic DOX...|[DOX-LNA, conjuga...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|Our objective was...|[objective, blood...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|We conducted a ph...|[II, a phase, stu...|

In [13]:
ldaModel = mlModel.stages[2]

In [14]:
ll = ldaModel.logLikelihood(mlPipelineDF)
lp = ldaModel.logPerplexity(mlPipelineDF)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))


The lower bound on the log likelihood of the entire corpus: -474.6273762902821
The upper bound on perplexity: 20.635972882186177


In [15]:
# Describe topics.
print("The topics described by their top-weighted terms:")
ldaModel.describeTopics(3).show(truncate=False)

The topics described by their top-weighted terms:
+-----+------------+-----------------------------------------------------------------+
|topic|termIndices |termWeights                                                      |
+-----+------------+-----------------------------------------------------------------+
|0    |[1, 14, 16] |[0.18960784769081807, 0.025568247138736146, 0.025372059866659222]|
|1    |[38, 5, 22] |[0.03254468182866637, 0.031060649278838486, 0.030587731414429147]|
|2    |[13, 10, 26]|[0.032003890160653546, 0.03181408544349651, 0.0302091169229786]  |
|3    |[16, 11, 6] |[0.03042165553411352, 0.02967264902618647, 0.029588840236766063] |
|4    |[32, 21, 19]|[0.03062140328168646, 0.0296858256326627, 0.029651478380734106]  |
|5    |[14, 20, 5] |[0.0314283137127807, 0.030943996152498814, 0.029890273233763708] |
|6    |[14, 2, 25] |[0.030355539329409865, 0.02958853115882957, 0.02892428233776194] |
|7    |[20, 38, 12]|[0.03160353985995292, 0.03039394774029663, 0.030041690192499

### Let's look at out topics
NOTE: More cleaning, filtering, playing around with `CountVectorizer`, and more iterations in `LDA` will result in better Topic Modelling results.

In [16]:
# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize())
      + " words):")

topics = ldaModel.describeTopics(20)
topics_rdd = topics.rdd

vocab = mlModel.stages[0].vocabulary

topics_words = topics_rdd\
       .map(lambda row: row['termIndices'])\
       .map(lambda idx_list: [vocab[idx] for idx in idx_list])\
       .collect()

for idx, topic in enumerate(topics_words):
    print("topic: ", idx)
    print("----------")
    for word in topic:
        print(word)
    print("----------")

Learned topics (as distributions over vocab of 39 words):
topic:  0
----------
),
BACKGROUND
activity
age
cell
risk
this study
response
CONCLUSIONS
addition
contrast
disease
function
factor
study
).
DNA
cancer
protein
expression
----------
topic:  1
----------
factor
protein
gene
<
group
this study
contrast
risk
serum
vitro
time
analysis
P
study
CONCLUSION
family
disease
PURPOSE
vivo
treatment
----------
topic:  2
----------
rate
CONCLUSIONS
study
expression
disease
).
therapy
P
vitro
response
serum
method
contrast
time
protein
addition
METHODS
DNA
group
level
----------
topic:  3
----------
activity
<
group
contrast
vitro
vivo
method
cell
).
BACKGROUND
CONCLUSIONS
disease
PURPOSE
CONCLUSION
analysis
factor
age
family
addition
protein
----------
topic:  4
----------
risk
serum
CONCLUSION
this study
contrast
level
),
<
METHODS
treatment
analysis
addition
response
protein
rate
therapy
cancer
P
time
vivo
----------
topic:  5
----------
BACKGROUND
expression
protein
response
cancer
activit