![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP and Spark ML Pipelines

## Simple Topic Modeling

`Spark-NLP`
* DocumentAssembler
* SentenceDetector
* Tokenizer
* Normalizer
* POS tagger
* Chunker
* Finisher

`Spark ML`
* Hashing
* TF-IDF
* LDA

In [1]:
import sys
import time

from pyspark.sql.functions import col
from pyspark.ml.feature import CountVectorizer, HashingTF, IDF, Tokenizer
from pyspark.ml.clustering import LDA, LDAModel

#Spark NLP
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import *

### Let's create a Spark Session for our app

In [2]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  2.6.0
Apache Spark version:  2.4.4


Let's download some scientific sample from PubMed dataset:
```
wget -N 	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv -P /tmp
```

In [3]:
! wget -N 	https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv -P /tmp

--2020-09-13 17:31:17--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed-sample.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.184.157
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.184.157|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10484510 (10.0M) [text/csv]
Saving to: ‘/tmp/pubmed-sample.csv’


2020-09-13 17:31:20 (4.26 MB/s) - ‘/tmp/pubmed-sample.csv’ saved [10484510/10484510]



In [4]:
pubMedDF = spark.read\
                .option("header", "true")\
                .csv("/tmp/pubmed-sample.csv")\
                .filter("AB IS NOT null")\
                .withColumn("text", col("AB"))\
                .drop("TI", "AB")

In [5]:
pubMedDF.printSchema()
pubMedDF.show()
print('rows', pubMedDF.count())
pubMedDF = pubMedDF.limit(200) #minimize dataset if you are not running on a cluster

root
 |-- text: string (nullable = true)

+--------------------+
|                text|
+--------------------+
|The human KCNJ9 (...|
|BACKGROUND: At pr...|
|OBJECTIVE: To inv...|
|Combined EEG/fMRI...|
|Kohlschutter synd...|
|Statistical analy...|
|The synthetic DOX...|
|Our objective was...|
|We conducted a ph...|
|"Monomeric sarcos...|
|We presented the ...|
|The literature de...|
|A novel approach ...|
|An HPLC-ESI-MS-MS...|
|The localizing an...|
|OBJECTIVE: To eva...|
|For the construct...|
|We report the res...|
|Intraparenchymal ...|
|It is known that ...|
+--------------------+
only showing top 20 rows

rows 7537


### Let's create Spark-NLP Pipeline

In [6]:
# Spark NLP Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
  .setInputCols(["sentence", "token"])

chunker = Chunker() \
    .setInputCols(["sentence", "pos"]) \
    .setOutputCol("chunk") \
    .setRegexParsers(["<NNP>+", "<DT>?<JJ>*<NN>"])

finisher = Finisher() \
  .setInputCols(["chunk"]) \
  .setIncludeMetadata(False)

nlpPipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer,
    posTagger,
    chunker,
    finisher
])

pos_anc download started this may take some time.
Approximate size to download 4.3 MB
[OK!]


In [7]:
nlpPipelineDF = nlpPipeline.fit(pubMedDF).transform(pubMedDF)

### Let's create Spark ML Pipeline

In [8]:
# SPark ML Pipeline

cv = CountVectorizer(inputCol="finished_chunk", outputCol="features", vocabSize=1000, minDF=10.0, minTF=10.0)
idf = IDF(inputCol="features", outputCol="idf")
lda = LDA(k=10, maxIter=5)
### Let's create Spark-NLP Pipeline
mlPipeline = Pipeline(stages=[
    cv,
    idf,
    lda
])

### We are going to train Spark ML Pipeline by using Spark-NLP Pipeline

In [9]:
# Let's create Spark-NLP Pipeline
mlModel = mlPipeline.fit(nlpPipelineDF)

In [10]:
mlPipelineDF = mlModel.transform(nlpPipelineDF)

In [11]:
mlPipelineDF.show()

+--------------------+--------------------+----------+----------+--------------------+
|                text|      finished_chunk|  features|       idf|   topicDistribution|
+--------------------+--------------------+----------+----------+--------------------+
|The human KCNJ9 (...|[KCNJ9, Kir, GIRK...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|BACKGROUND: At pr...|[BACKGROUND, the ...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|OBJECTIVE: To inv...|[OBJECTIVE, =9796...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|Combined EEG/fMRI...|[Combined EEG/fMR...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|Kohlschutter synd...|[Kohlschutter, sy...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|Statistical analy...|[Statistical, ana...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|The synthetic DOX...|[DOX-LNA, conjuga...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|Our objective was...|[objective, blood...|(39,[],[])|(39,[],[])|[0.0,0.0,0.0,0.0,...|
|We conducted a ph...|[II, a phase, stu...|

In [12]:
ldaModel = mlModel.stages[2]

In [13]:
ll = ldaModel.logLikelihood(mlPipelineDF)
lp = ldaModel.logPerplexity(mlPipelineDF)
print("The lower bound on the log likelihood of the entire corpus: " + str(ll))
print("The upper bound on perplexity: " + str(lp))


The lower bound on the log likelihood of the entire corpus: -473.13472468195687
The upper bound on perplexity: 20.57107498617204


In [14]:
# Describe topics.
print("The topics described by their top-weighted terms:")
ldaModel.describeTopics(3).show(truncate=False)

The topics described by their top-weighted terms:
+-----+------------+-----------------------------------------------------------------+
|topic|termIndices |termWeights                                                      |
+-----+------------+-----------------------------------------------------------------+
|0    |[31, 27, 9] |[0.03257476024357697, 0.030675255742522122, 0.03046738952388627] |
|1    |[5, 35, 8]  |[0.030694164334629563, 0.029606943258556426, 0.02874382837700391]|
|2    |[9, 17, 1]  |[0.030174246224099154, 0.03008463621531343, 0.029110783343217993]|
|3    |[16, 21, 17]|[0.030326862202922272, 0.03006946863540257, 0.028664585585146864]|
|4    |[11, 34, 17]|[0.0327023724028336, 0.032491287107370766, 0.029777996132834567] |
|5    |[1, 4, 10]  |[0.19069340337059643, 0.027078607193686037, 0.02691452950534889] |
|6    |[20, 29, 32]|[0.03305884263172731, 0.03217891937972598, 0.030787830019083622] |
|7    |[27, 20, 7] |[0.033333923807145116, 0.029616433465517945, 0.0293579930088

### Let's look at out topics
NOTE: More cleaning, filtering, playing around with `CountVectorizer`, and more iterations in `LDA` will result in better Topic Modelling results.

In [15]:
# Output topics. Each is a distribution over words (matching word count vectors)
print("Learned topics (as distributions over vocab of " + str(ldaModel.vocabSize())
      + " words):")

topics = ldaModel.describeTopics(20)
topics_rdd = topics.rdd

vocab = mlModel.stages[0].vocabulary

topics_words = topics_rdd\
       .map(lambda row: row['termIndices'])\
       .map(lambda idx_list: [vocab[idx] for idx in idx_list])\
       .collect()

for idx, topic in enumerate(topics_words):
    print("topic: ", idx)
    print("----------")
    for word in topic:
        print(word)
    print("----------")

Learned topics (as distributions over vocab of 39 words):
topic:  0
----------
therapy
OBJECTIVE
treatment
addition
function
age
group
<
this study
CONCLUSION
protein
factor
DNA
METHODS
time
expression
gene
risk
response
).
----------
topic:  1
----------
protein
family
analysis
),
CONCLUSIONS
group
rate
treatment
response
vitro
function
cell
activity
<
addition
cancer
PURPOSE
factor
CONCLUSION
disease
----------
topic:  2
----------
treatment
function
),
PURPOSE
vivo
vitro
serum
cell
method
CONCLUSIONS
P
time
group
BACKGROUND
disease
CONCLUSION
response
DNA
OBJECTIVE
analysis
----------
topic:  3
----------
activity
serum
function
therapy
vivo
cell
response
vitro
factor
METHODS
OBJECTIVE
),
addition
family
CONCLUSIONS
PURPOSE
treatment
age
P
study
----------
topic:  4
----------
<
contrast
function
this study
analysis
vivo
CONCLUSION
cell
),
family
rate
time
therapy
OBJECTIVE
response
protein
CONCLUSIONS
risk
activity
METHODS
----------
topic:  5
----------
),
cancer
CONCLUSIONS
thera