![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP Basics

## 1. Start Spark Session

In [4]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())

print("Apache Spark version:", spark.version)


In [5]:
#!pip install spark-nlp==2.4.5

import sparknlp

sparknlp.version()

In [6]:
!pip install spark-nlp==2.4.5

`sparknlp.start()` will start or get SparkSession with predefined parameters hardcoded in `spark-nlp/python/sparknlp/__init__.py`. here is what is going on on the background when you run `sparknlp.start()`

In [8]:
# repo >> spark-nlp/python/sparknlp/__init__.py

from pyspark.sql import SparkSession

def start(gpu=False):
    builder = SparkSession.builder \
        .appName("Spark NLP") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "1000M")
    if gpu:
        builder.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-gpu_2.11:2.4.5")
    else:
        builder.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.4.5")
        
    return builder.getOrCreate()


If you want to start `SparkSession` with your own parameters or you need to load the required jars/packages from your local disk, or you have no internet connection (that would be needed to pull the required packages from internet), you can skip `sparknlp.start()` and start your session manually as shown below.

In [10]:
"""
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark NLP Enterprise 2.4.5") \
    .master("local[8]") \
    .config("spark.driver.memory","12G") \
    .config("spark.driver.maxResultSize", "2G") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.kryoserializer.buffer.max", "800M")\
    .config("spark.jars", "{}spark-nlp-2.4.5.jar,{}spark-nlp-jsl-2.4.5.jar".format(jar_path,jar_path)) \
    .getOrCreate()
    
"""

## 2. Using Pretrained Pipelines

https://github.com/JohnSnowLabs/spark-nlp-models

In [13]:
from sparknlp.pretrained import PretrainedPipeline

In [14]:
testDoc = '''
Peter is a very good persn.
My life in Russia is very intersting.
John and Peter are brothrs. However they don't support each other that much.
Lucas Nogal Dunbercker is no longer happy. He has a good car though.
Europe is very culture rich. There are huge churches! and big houses!
'''

### Explain Document ML

**Stages**
- DocumentAssembler
- SentenceDetector
- Tokenizer
- Lemmatizer
- Stemmer
- Part of Speech
- SpellChecker (Norvig)

In [17]:
pipeline = PretrainedPipeline('explain_document_ml', lang='en')

In [18]:
##%%time
result = pipeline.annotate(testDoc)

In [19]:
result.keys()

In [20]:
result['sentence']

In [21]:
result['token']

In [22]:
list(zip(result['token'], result['pos']))

In [23]:
list(zip(result['token'], result['lemmas'], result['stems'], result['spell']))

In [24]:
import pandas as pd

df = pd.DataFrame({'token':result['token'], 
                      'corrected':result['spell'], 'POS':result['pos'],
                      'lemmas':result['lemmas'], 'stems':result['stems']})
df

Unnamed: 0,token,corrected,POS,lemmas,stems
0,Peter,Peter,NNP,Peter,peter
1,is,is,VBZ,be,i
2,a,a,DT,a,a
3,very,very,RB,very,veri
4,good,good,JJ,good,good
5,persn,person,NN,person,person
6,.,.,.,.,.
7,My,My,PRP$,My,my
8,life,life,NN,life,life
9,in,in,IN,in,in


### Explain Document DL

### Recognize Entities DL

In [27]:
recognize_entities = PretrainedPipeline('recognize_entities_dl', lang='en')

In [28]:
testDoc = '''
Peter is a very good persn.
My life in Russia is very intersting.
John and Peter are brothrs. However they don't support each other that much.
Lucas Nogal Dunbercker is no longer happy. He has a good car though.
Europe is very culture rich. There are huge churches! and big houses!
'''

result = recognize_entities.annotate(testDoc)

list(zip(result['token'], result['ner']))

### Clean Stop Words

In [30]:
clean_stop = PretrainedPipeline('clean_stop', lang='en')

In [31]:
result = clean_stop.annotate(testDoc)

' '.join(result['cleanTokens'])

### Clean Slang

In [33]:
clean_slang = PretrainedPipeline('clean_slang', lang='en')

result = clean_slang.annotate(' Whatsup bro, call me ASAP')

' '.join(result['normal'])

**Stages**
- DocumentAssembler
- SentenceDetector
- Tokenizer
- NER (NER with GloVe, CoNLL2003 dataset)
- Lemmatizer
- Stemmer
- Part of Speech
- SpellChecker (Norvig)

In [35]:
pipeline_dl = PretrainedPipeline('explain_document_dl', lang='en')


In [36]:
result = pipeline_dl.annotate(testDoc)

result.keys()

In [37]:
result['entities']

In [38]:
df = pd.DataFrame({'token':result['token'], 'ner_label':result['ner'],
                      'spell_corrected':result['checked'], 'POS':result['pos'],
                      'lemmas':result['lemma'], 'stems':result['stem']})

df

Unnamed: 0,token,ner_label,spell_corrected,POS,lemmas,stems
0,Peter,B-PER,Peter,NNP,Peter,peter
1,is,O,is,VBZ,be,i
2,a,O,a,DT,a,a
3,very,O,very,RB,very,veri
4,good,O,good,JJ,good,good
5,persn,O,person,NN,person,person
6,.,O,.,.,.,.
7,My,O,My,PRP$,My,my
8,life,O,life,NN,life,life
9,in,O,in,IN,in,in


### Spell Checker

In [40]:
spell_checker = PretrainedPipeline('check_spelling', lang='en')


In [41]:
result = spell_checker.annotate(testDoc)

result.keys()

In [42]:
list(zip(result['token'], result['checked']))

### Spell Checker DL
https://medium.com/spark-nlp/applying-context-aware-spell-checking-in-spark-nlp-3c29c46963bc

In [44]:
spell_checker_dl = PretrainedPipeline('check_spelling_dl', lang='en')

In [45]:
text = 'We will go to swimming if the ueather is nice.'

result = spell_checker_dl.annotate(text)

list(zip(result['token'], result['checked']))

In [46]:
result.keys()

In [47]:
# check for the different occurrences of the word "ueather"
examples = ['We will go to swimming if the ueather is nice.',\
    "I have a black ueather jacket, so nice.",\
    "I introduce you to my sister, she is called ueather."]

results = spell_checker_dl.annotate(examples)

for result in results:
  print (list(zip(result['token'], result['checked'])))

In [48]:
for result in results:
  print (result['document'],'>>',[pairs for pairs in list(zip(result['token'], result['checked'])) if pairs[0]!=pairs[1]])

In [49]:
# if we had tried the same with spell_checker (previous version)

results = spell_checker.annotate(examples)

for result in results:
  print (list(zip(result['token'], result['checked'])))

### Parsing a list of texts

In [51]:
testDoc_list = ['French author who helped pioner the science-fiction genre.',
'Verne wrate about space, air, and underwater travel before navigable aircrast',
'Practical submarines were invented, and before any means of space travel had been devised.']

testDoc_list

In [52]:
result_list = pipeline.annotate(testDoc_list)

len (result_list)

In [53]:
result_list[0]

### Using fullAnnotate to get more details

In [55]:
text = 'Peter Parker is a nice guy and lives in New York'

In [56]:
# pipeline_dl >> explain_document_dl

detailed_result = pipeline_dl.fullAnnotate(text)

In [57]:
detailed_result

In [59]:
detailed_result[0]['entities']

In [60]:
chunks=[]
entities=[]
for n in detailed_result[0]['entities']:
        
  chunks.append(n.result)
  entities.append(n.metadata['entity']) 
    
df = pd.DataFrame({'chunks':chunks, 'entities':entities})
df    

Unnamed: 0,chunks,entities
0,Peter Parker,PER
1,New York,LOC


In [61]:
tuples = []

for x,y,z in zip(detailed_result[0]["token"], detailed_result[0]["pos"], detailed_result[0]["ner"]):

  tuples.append((int(x.metadata['sentence']), x.result, x.begin, x.end, y.result, z.result))

df = pd.DataFrame(tuples, columns=['sent_id','token','start','end','pos', 'ner'])

df


Unnamed: 0,sent_id,token,start,end,pos,ner
0,0,Peter,0,4,NNP,B-PER
1,0,Parker,6,11,NNP,I-PER
2,0,is,13,14,VBZ,O
3,0,a,16,16,DT,O
4,0,nice,18,21,JJ,O
5,0,guy,23,25,NN,O
6,0,and,27,29,CC,O
7,0,lives,31,35,NNS,O
8,0,in,37,38,IN,O
9,0,New,40,42,NNP,B-LOC


### Use pretrained match_chunk Pipeline for Individual Noun Phrase

**Stages**
- DocumentAssembler
- SentenceDetector
- Tokenizer
- Part of Speech
- Chunker

Pipeline:

- The pipeline uses regex `<DT>?<JJ>*<NN>+`
- which states that whenever the chunk finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN) then the Noun Phrase(NP) chunk should be formed.

In [64]:
pipeline = PretrainedPipeline('match_chunks', lang='en')


In [65]:
result = pipeline.annotate("The book has many chapters") # single noun phrase


In [66]:
result

In [67]:
result['chunk']

In [68]:
result = pipeline.annotate("the little yellow dog barked at the cat") #multiple noune phrases

In [69]:
result

In [70]:
result['chunk']

### Extract exact dates from referential date phrases

In [73]:
pipeline = PretrainedPipeline('match_datetime', lang='en')


In [74]:
result = pipeline.annotate("I saw him yesterday and he told me that he will visit us next week")

result

In [75]:
pipeline.fullAnnotate("I saw him yesterday and he told me that he will visit us next week")

In [76]:
tuples = []

for x in detailed_result[0]["token"]:

  tuples.append((int(x.metadata['sentence']), x.result, x.begin, x.end))

df = pd.DataFrame(tuples, columns=['sent_id','token','start','end'])

df

Unnamed: 0,sent_id,token,start,end
0,0,Peter,0,4
1,0,Parker,6,11
2,0,is,13,14
3,0,a,16,16
4,0,nice,18,21
5,0,guy,23,25
6,0,and,27,29
7,0,lives,31,35
8,0,in,37,38
9,0,New,40,42


### Sentiment Analysis
#### Vivek algo

In [78]:
pipeline = PretrainedPipeline('analyze_sentiment', lang='en')

In [79]:
result = pipeline.annotate("The movie I watched today was not a good one")

result['sentiment']

#### DL version (trained on imdb)

In [81]:
sentiment_imdb = PretrainedPipeline('analyze_sentimentdl_use_imdb', lang='en')

In [82]:
sentiment_imdb_glove = PretrainedPipeline('analyze_sentimentdl_glove_imdb', lang='en')

In [83]:
comment = '''
It's a very scary film but what impressed me was how true the film sticks to the original's tricks; it isn't filled with loud in-your-face jump scares, in fact, a lot of what makes this film scary is the slick cinematography and intricate shadow play. The use of lighting and creation of atmosphere is what makes this film so tense, which is why it's perfectly suited for those who like Horror movies but without the obnoxious gore.
'''
result = sentiment_imdb_glove.annotate(comment)

result['sentiment']

In [84]:
sentiment_imdb_glove.fullAnnotate(comment)[0]['sentiment']

#### DL version (trained on twitter dataset)

In [86]:
sentiment_twitter = PretrainedPipeline('analyze_sentimentdl_use_twitter', lang='en')

In [87]:
result = sentiment_twitter.annotate("The movie I watched today was not a good one")

result['sentiment']