![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# 1.Quickstart Tutorial on Spark NLP - 1 hr

This is the 1 hr workshop version of the entire training notebooks : https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public

an intro article for Spark NLP:

https://towardsdatascience.com/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59

How to start Spark NLP in 2 weeks:

https://towardsdatascience.com/how-to-get-started-with-sparknlp-in-2-weeks-cb47b2ba994d

https://towardsdatascience.com/how-to-wrap-your-head-around-spark-nlp-a6f6a968b7e8

Article for NER and text classification in Spark NLP

https://towardsdatascience.com/named-entity-recognition-ner-with-bert-in-spark-nlp-874df20d1d77

https://medium.com/spark-nlp/named-entity-recognition-for-healthcare-with-sparknlp-nerdl-and-nercrf-a7751b6ad571

https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32

a webinar to show how to train a NER model from scratch (90 min)

https://www.youtube.com/watch?v=djWX0MR2Ooo

workshop repo that you can start playing with Spark NLP in Colab:

https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings

Databrikcs Notebooks: 

https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/products/databricks

## Coding ...

In [0]:
import sparknlp

from sparknlp.base import *
from sparknlp.annotator import *

from pyspark.ml import Pipeline

print("Spark NLP version", sparknlp.version())

spark

Spark NLP version 4.3.1


## Using Pretrained Pipelines

for a more detailed notebook, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb

In [0]:
from sparknlp.pretrained import PretrainedPipeline

pipeline_dl = PretrainedPipeline('explain_document_dl', lang='en')


explain_document_dl download started this may take some time.
Approx size to download 169.4 MB
[ | ][OK!]


**Stages**
- DocumentAssembler
- SentenceDetector
- Tokenizer
- NER (NER with GloVe 100D embeddings, CoNLL2003 dataset)
- Lemmatizer
- Stemmer
- Part of Speech
- SpellChecker (Norvig)

In [0]:
testDoc = '''
Peter Parker is a very good persn.
My life in Russia is very intersting.
John and Peter are brthers. However they don't support each other that much.
Mercedes Benz is also working on a driverless car.
Europe is very culture rich. There are huge churches! and big houses!
'''

result = pipeline_dl.annotate(testDoc)


In [0]:
result.keys()

Out[4]: dict_keys(['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence'])

In [0]:
result['entities']

Out[5]: ['Peter Parker', 'Russia', 'John', 'Peter', 'Mercedes Benz', 'Europe']

In [0]:
import pandas as pd

df = pd.DataFrame({'token':result['token'], 'ner_label':result['ner'],
                      'spell_corrected':result['checked'], 'POS':result['pos'],
                      'lemmas':result['lemma'], 'stems':result['stem']})

df



Unnamed: 0,token,ner_label,spell_corrected,POS,lemmas,stems
0,Peter,B-PER,Peter,NNP,Peter,peter
1,Parker,I-PER,Parker,NNP,Parker,parker
2,is,O,is,VBZ,be,i
3,a,O,a,DT,a,a
4,very,O,very,RB,very,veri
5,good,O,good,JJ,good,good
6,persn,O,person,NN,person,person
7,.,O,.,.,.,.
8,My,O,My,PRP$,My,my
9,life,O,life,NN,life,life


### Using fullAnnotate to get more details

In [0]:
detailed_result = pipeline_dl.fullAnnotate(testDoc)

detailed_result[0]['entities']

Out[7]: [Annotation(chunk, 1, 12, Peter Parker, {'entity': 'PER', 'sentence': '0', 'chunk': '0'}, []),
 Annotation(chunk, 47, 52, Russia, {'entity': 'LOC', 'sentence': '1', 'chunk': '1'}, []),
 Annotation(chunk, 74, 77, John, {'entity': 'PER', 'sentence': '2', 'chunk': '2'}, []),
 Annotation(chunk, 83, 87, Peter, {'entity': 'PER', 'sentence': '2', 'chunk': '3'}, []),
 Annotation(chunk, 151, 163, Mercedes Benz, {'entity': 'ORG', 'sentence': '4', 'chunk': '4'}, []),
 Annotation(chunk, 202, 207, Europe, {'entity': 'LOC', 'sentence': '5', 'chunk': '5'}, [])]

In [0]:
chunks=[]
entities=[]
for n in detailed_result[0]['entities']:
        
  chunks.append(n.result)
  entities.append(n.metadata['entity']) 
    
df = pd.DataFrame({'chunks':chunks, 'entities':entities})
df    

Unnamed: 0,chunks,entities
0,Peter Parker,PER
1,Russia,LOC
2,John,PER
3,Peter,PER
4,Mercedes Benz,ORG
5,Europe,LOC


In [0]:
tuples = []

for x,y,z in zip(detailed_result[0]["token"], detailed_result[0]["pos"], detailed_result[0]["ner"]):

  tuples.append((int(x.metadata['sentence']), x.result, x.begin, x.end, y.result, z.result))

df = pd.DataFrame(tuples, columns=['sent_id','token','start','end','pos', 'ner'])

df


Unnamed: 0,sent_id,token,start,end,pos,ner
0,0,Peter,1,5,NNP,B-PER
1,0,Parker,7,12,NNP,I-PER
2,0,is,14,15,VBZ,O
3,0,a,17,17,DT,O
4,0,very,19,22,RB,O
5,0,good,24,27,JJ,O
6,0,persn,29,33,NN,O
7,0,.,34,34,.,O
8,1,My,36,37,PRP$,O
9,1,life,39,42,NN,O


### Sentiment Analysis

In [0]:
sentiment = PretrainedPipeline('analyze_sentiment', lang='en')

analyze_sentiment download started this may take some time.
Approx size to download 4.9 MB
[ | ][OK!]


In [0]:
result = sentiment.annotate("The movie I watched today was not a good one")

result['sentiment']

Out[11]: ['negative']

In [0]:
sentiment_imdb_glove = PretrainedPipeline('analyze_sentimentdl_glove_imdb', lang='en')

analyze_sentimentdl_glove_imdb download started this may take some time.
Approx size to download 155.3 MB
[ | ][OK!]


In [0]:
comment = '''
It's a very scary film but what impressed me was how true the film sticks to the original's tricks; it isn't filled with loud in-your-face jump scares, in fact, a lot of what makes this film scary is the slick cinematography and intricate shadow play. The use of lighting and creation of atmosphere is what makes this film so tense, which is why it's perfectly suited for those who like Horror movies but without the obnoxious gore.
'''
result = sentiment_imdb_glove.annotate(comment)

result['sentiment']

Out[13]: ['pos']

## Using the modules in a pipeline for custom tasks

for a more detailed notebook, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb

In [0]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/examples/python/annotation/text/english/spark-nlp-basics/sample-sentences-en.txt
  
dbutils.fs.cp("file:/databricks/driver/sample-sentences-en.txt", "dbfs:/") 

Out[14]: True

In [0]:
with open('sample-sentences-en.txt') as f:
  print (f.read())

Peter is a very good person.
My life in Russia is very interesting.
John and Peter are brothers. However they don't support each other that much.
Lucas Nogal Dunbercker is no longer happy. He has a good car though.
Europe is very culture rich. There are huge churches! and big houses!


In [0]:
spark_df = spark.read.text('/sample-sentences-en.txt').toDF('text')

spark_df.show(truncate=False)

+-----------------------------------------------------------------------------+
|text                                                                         |
+-----------------------------------------------------------------------------+
|Peter is a very good person.                                                 |
|My life in Russia is very interesting.                                       |
|John and Peter are brothers. However they don't support each other that much.|
|Lucas Nogal Dunbercker is no longer happy. He has a good car though.         |
|Europe is very culture rich. There are huge churches! and big houses!        |
+-----------------------------------------------------------------------------+



In [0]:
textFiles = spark.sparkContext.wholeTextFiles("/sample-sentences-en.txt",4) # or/*.txt
    
spark_df_folder = textFiles.toDF(schema=['path','text'])

spark_df_folder.show(truncate=30)

+-----------------------------+------------------------------+
|                         path|                          text|
+-----------------------------+------------------------------+
|dbfs:/sample-sentences-en.txt|Peter is a very good person...|
+-----------------------------+------------------------------+



In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentences')

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("token")

nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     sentenceDetector,
     tokenizer
 ])

empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

In [0]:
result = pipelineModel.transform(spark_df)


In [0]:
result.show(truncate=20)


+--------------------+--------------------+--------------------+--------------------+
|                text|            document|           sentences|               token|
+--------------------+--------------------+--------------------+--------------------+
|Peter is a very g...|[{document, 0, 27...|[{document, 0, 27...|[{token, 0, 4, Pe...|
|My life in Russia...|[{document, 0, 37...|[{document, 0, 37...|[{token, 0, 1, My...|
|John and Peter ar...|[{document, 0, 76...|[{document, 0, 27...|[{token, 0, 3, Jo...|
|Lucas Nogal Dunbe...|[{document, 0, 67...|[{document, 0, 41...|[{token, 0, 4, Lu...|
|Europe is very cu...|[{document, 0, 68...|[{document, 0, 27...|[{token, 0, 5, Eu...|
+--------------------+--------------------+--------------------+--------------------+



In [0]:
result.printSchema()


root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentences: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = tru

In [0]:
result.select('sentences.result').take(3)


Out[22]: [Row(result=['Peter is a very good person.']),
 Row(result=['My life in Russia is very interesting.']),
 Row(result=['John and Peter are brothers.', "However they don't support each other that much."])]

### StopWords Cleaner

This annotator excludes from a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

**Functions**:



- **setStopWords**      : The words to be filtered out. Array[String]

- **setCaseSensitive**   : Whether to do a case sensitive comparison over the stop words.

In [0]:
stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("token")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

In [0]:
stopwords_cleaner.getStopWords()[:10]

Out[24]: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     tokenizer,
     stopwords_cleaner
 ])

empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(spark_df)

result.show()

+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|         cleanTokens|
+--------------------+--------------------+--------------------+--------------------+
|Peter is a very g...|[{document, 0, 27...|[{token, 0, 4, Pe...|[{token, 0, 4, Pe...|
|My life in Russia...|[{document, 0, 37...|[{token, 0, 1, My...|[{token, 3, 6, li...|
|John and Peter ar...|[{document, 0, 76...|[{token, 0, 3, Jo...|[{token, 0, 3, Jo...|
|Lucas Nogal Dunbe...|[{document, 0, 67...|[{token, 0, 4, Lu...|[{token, 0, 4, Lu...|
|Europe is very cu...|[{document, 0, 68...|[{token, 0, 5, Eu...|[{token, 0, 5, Eu...|
+--------------------+--------------------+--------------------+--------------------+



In [0]:
result.select('cleanTokens.result').take(1)


Out[26]: [Row(result=['Peter', 'good', 'person', '.'])]

### Text Matcher

Annotator to match entire phrases (by token) provided in a file against a Document

Functions:

`setEntities(path, format, options)`: Provides a file with phrases to match. Default: Looks up path in configuration.

`path`: a path to a file that contains the entities in the specified format.

`readAs`: the format of the file, can be one of {ReadAs.LINE_BY_LINE, ReadAs.SPARK_DATASET}. Defaults to LINE_BY_LINE.

`options`: a map of additional parameters. Defaults to {“format”: “text”}.

`entityValue` : Value for the entity metadata field to indicate which chunk comes from which textMatcher when there are multiple textMatchers. 

`mergeOverlapping` : whether to merge overlapping matched chunks. Defaults false

`caseSensitive` : whether to match regardless of case. Defaults true

In [0]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv
  
dbutils.fs.cp("file:/databricks/driver/news_category_train.csv", "dbfs:/")

Out[27]: True

In [0]:
news_df = spark.read \
      .option("header", True) \
      .csv("/news_category_train.csv")


news_df.show(5, truncate=50)

+--------+--------------------------------------------------+
|category|                                       description|
+--------+--------------------------------------------------+
|Business| Short sellers, Wall Street's dwindling band of...|
|Business| Private investment firm Carlyle Group, which h...|
|Business| Soaring crude prices plus worries about the ec...|
|Business| Authorities have halted oil export flows from ...|
|Business| Tearaway world oil prices, toppling records an...|
+--------+--------------------------------------------------+
only showing top 5 rows



In [0]:
entities = ['Wall Street', 'USD', 'stock', 'NYSE']
with open ('financial_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')


entities = ['soccer', 'world cup', 'Messi', 'FC Barcelona']
with open ('sport_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')


In [0]:
dbutils.fs.cp("file:/databricks/driver/financial_entities.txt", "dbfs:/")
dbutils.fs.cp("file:/databricks/driver/sport_entities.txt", "dbfs:/")

Out[30]: True

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

financial_entity_extractor = TextMatcher() \
    .setInputCols(["document",'token'])\
    .setOutputCol("financial_entities")\
    .setEntities("file:/databricks/driver/financial_entities.txt")\
    .setCaseSensitive(False)\
    .setEntityValue('financial_entity')

sport_entity_extractor = TextMatcher() \
    .setInputCols(["document",'token'])\
    .setOutputCol("sport_entities")\
    .setEntities("file:/databricks/driver/sport_entities.txt")\
    .setCaseSensitive(False)\
    .setEntityValue('sport_entity')


nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     tokenizer,
     financial_entity_extractor,
     sport_entity_extractor
 ])

empty_df = spark.createDataFrame([['']]).toDF("description")

pipelineModel = nlpPipeline.fit(empty_df)

In [0]:
result = pipelineModel.transform(news_df)


In [0]:
result.select('financial_entities.result','sport_entities.result').take(2)


Out[33]: [Row(result=[], result=[]), Row(result=[], result=[])]

This means there are no financial and sport entities in the first two lines.

In [0]:
from pyspark.sql import functions as F

result.select('description','financial_entities.result','sport_entities.result')\
      .toDF('text','financial_matches','sport_matches')\
      .filter((F.size('financial_matches')>1) | (F.size('sport_matches')>1))\
      .show(truncate=70)


+----------------------------------------------------------------------+----------------------------------+-------------------+
|                                                                  text|                 financial_matches|      sport_matches|
+----------------------------------------------------------------------+----------------------------------+-------------------+
|"Company launched the biggest electronic auction of stock in Wall S...|              [stock, Wall Street]|                 []|
|Google, Inc. significantly cut the expected share price for its ini...|                    [stock, stock]|                 []|
|Google, Inc. significantly cut the expected share price this mornin...|                    [stock, stock]|                 []|
| Shares of Air Canada  (AC.TO) fell by more than half on Wednesday,...|                    [Stock, stock]|                 []|
|Stock prices are lower in moderate trading. The Dow Jones Industria...|                    [Stock, Stoc

### Using the pipeline in a LightPipeline

In [0]:
light_model = LightPipeline(pipelineModel)

light_result = light_model.fullAnnotate("Google, Inc. significantly cut the expected share price for its stock at Wall Street")

light_result[0]['financial_entities']

Out[35]: [Annotation(chunk, 64, 68, stock, {'entity': 'financial_entity', 'sentence': '0', 'chunk': '0'}, []),
 Annotation(chunk, 73, 83, Wall Street, {'entity': 'financial_entity', 'sentence': '0', 'chunk': '1'}, [])]

## Pretrained Models

Spark NLP offers the following pre-trained models in around **200+ languages** and all you need to do is to load the pre-trained model into your disk by specifying the model name and then configuring the model parameters as per your use case and dataset. Then you will not need to worry about training a new model from scratch and will be able to enjoy the pre-trained SOTA algorithms directly applied to your own data with transform().

In the official documentation, you can find detailed information regarding how these models are trained by using which algorithms and datasets.

https://github.com/JohnSnowLabs/spark-nlp-models

for a more detailed notebook, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.SparkNLP_Pretrained_Models.ipynb

### LemmatizerModel and ContextSpellCheckerModel

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_dl')\
    .setInputCols("token")\
    .setOutputCol("checked")

lemmatizer = LemmatizerModel.pretrained('lemma_antbnc', 'en') \
    .setInputCols(["checked"]) \
    .setOutputCol("lemma")

pipeline = Pipeline(stages = [
    documentAssembler,
    tokenizer,
    spellModel,
    lemmatizer
  ])

empty_ds = spark.createDataFrame([[""]]).toDF("text")

sc_model = pipeline.fit(empty_ds)

lp = LightPipeline(sc_model)

spellcheck_dl download started this may take some time.
Approximate size to download 95.1 MB
[ | ][OK!]
lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[ | ][OK!]


In [0]:
result = lp.annotate("Plaese alliow me tao introdduce myhelf, I am a man of waelth und tiaste and he just knows that")

list(zip(result['token'],result['checked'],result['lemma']))

Out[37]: [('Plaese', 'Please', 'Please'),
 ('alliow', 'allow', 'allow'),
 ('me', 'me', 'i'),
 ('tao', 'to', 'to'),
 ('introdduce', 'introduce', 'introduce'),
 ('myhelf', 'myself', 'myself'),
 (',', ',', ','),
 ('I', 'I', 'I'),
 ('am', 'am', 'be'),
 ('a', 'a', 'a'),
 ('man', 'man', 'man'),
 ('of', 'of', 'of'),
 ('waelth', 'wealth', 'wealth'),
 ('und', 'and', 'and'),
 ('tiaste', 'taste', 'taste'),
 ('and', 'and', 'and'),
 ('he', 'he', 'he'),
 ('just', 'just', 'just'),
 ('knows', 'knows', 'know'),
 ('that', 'that', 'that')]

### Word and Sentence Embeddings

#### Word Embeddings

In [0]:
glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
      .setInputCols(["document", "token"])\
      .setOutputCol("embeddings")

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ | ][OK!]


In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     tokenizer,
     glove_embeddings
 ])

empty_df = spark.createDataFrame([['']]).toDF("description")
pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(news_df.limit(1))

output = result.select('token.result','embeddings.embeddings').limit(1).rdd.flatMap(lambda x: x).collect()


In [0]:
pd.DataFrame({'token':output[0],'embeddings':output[1]})

Unnamed: 0,token,embeddings
0,Short,"[-0.4308899939060211, -0.023907000198960304, -..."
1,sellers,"[0.1458200067281723, 0.2753300070762634, -0.20..."
2,",","[-0.10767000168561935, 0.11052999645471573, 0...."
3,Wall,"[0.21383999288082123, 0.22098000347614288, 0.0..."
4,Street's,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,dwindling,"[-0.5611299872398376, 1.1217999458312988, 0.65..."
6,band,"[-0.12160000205039978, -0.24347999691963196, 0..."
7,of,"[-0.15289999544620514, -0.24278999865055084, 0..."
8,ultra,"[-0.3504500091075897, -0.27733999490737915, 0...."
9,cynics,"[-0.06557200103998184, 0.45271000266075134, 0...."


In [0]:
result = pipelineModel.transform(news_df.limit(10))

result_df = result.select(F.explode(F.arrays_zip(result.token.result, result.embeddings.embeddings)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("embeddings"))

result_df.show(10, truncate=100)

+---------+----------------------------------------------------------------------------------------------------+
|    token|                                                                                          embeddings|
+---------+----------------------------------------------------------------------------------------------------+
|    Short|[-0.43089, -0.023907, -0.081875, 0.044522, 0.33741, -0.23081, -0.35145, 0.33043, -0.92222, -0.220...|
|  sellers|[0.14582, 0.27533, -0.20703, -0.30671, 0.54408, -0.18303, -0.38876, -0.52166, 0.3569, -1.085, 0.1...|
|        ,|[-0.10767, 0.11053, 0.59812, -0.54361, 0.67396, 0.10663, 0.038867, 0.35481, 0.06351, -0.094189, 0...|
|     Wall|[0.21384, 0.22098, 0.037105, -0.29186, -0.030131, -0.16247, -1.1043, -0.88436, -0.078059, -0.6353...|
| Street's|[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0...|
|dwindling|[-0.56113, 1.1218, 0.65823, 0.2699, 0.12404, -0.12759, -1.0287, -0.64777, 0.59677, -0

#### Bert Embeddings

In [0]:
bert_embeddings = BertEmbeddings.pretrained('bert_base_cased')\
      .setInputCols(["document", "token"])\
      .setOutputCol("embeddings")

bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[ | ][OK!]


In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

 
nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     tokenizer,
     bert_embeddings
 ])

empty_df = spark.createDataFrame([['']]).toDF("description")

pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(news_df.limit(10))

result_df = result.select(F.explode(F.arrays_zip(result.token.result, result.embeddings.embeddings)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("bert_embeddings"))

result_df.show(truncate=100)

+----------+----------------------------------------------------------------------------------------------------+
|     token|                                                                                     bert_embeddings|
+----------+----------------------------------------------------------------------------------------------------+
|     Short|[0.49818483, -0.3331852, 0.45259982, -0.29747552, -0.37543702, 0.31528342, 0.27503866, -0.0785038...|
|   sellers|[0.30120668, 0.60389084, 0.043556854, -0.083134115, 0.15532492, 0.19738051, 0.32395983, 0.2538229...|
|         ,|[-0.06041675, 0.16640486, -0.13490139, 0.14198671, -0.022981357, 0.021920532, 0.54551727, 0.43710...|
|      Wall|[-0.13157815, 0.43838108, -0.38766366, -0.33669645, 0.14921406, 0.030721106, 0.1554359, -0.070619...|
|  Street's|[0.35614696, 0.41715145, -0.022471193, -0.18717085, 0.45731312, 0.2509821, 0.20012417, 0.35113996...|
| dwindling|[0.5994897, -0.11001092, -0.19151862, -0.41372263, 0.40886638, -0.4010115, 0

#### Bert Sentence Embeddings

In [0]:
bert_sentence_embeddings = BertSentenceEmbeddings.pretrained('sent_small_bert_L6_128')\
    .setInputCols(["document"])\
    .setOutputCol("bert_sent_embeddings")


nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     bert_sentence_embeddings
 ])


empty_df = spark.createDataFrame([['']]).toDF("description")

pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(news_df.limit(10))

result_df = result.select(F.explode(F.arrays_zip(result.document.result, result.bert_sent_embeddings.embeddings)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("document"),
                          F.expr("cols['1']").alias("bert_sent_embeddings"))

result_df.show(truncate=100)

sent_small_bert_L6_128 download started this may take some time.
Approximate size to download 19 MB
[ | ][OK!]
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                            document|                                                                                bert_sent_embeddings|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again.|[-0.41675937, 0.5293154, -0.49145287, 0.37030974, -1.2797567, -1.4873856, 0.77664876, 0.6182901, ...|
| Private investment firm Carlyle Group, which has a reputation for mak

#### Universal Sentence Encoder

In [0]:
# no need for token columns 
use_embeddings = UniversalSentenceEncoder.pretrained('tfhub_use')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ][OK!]


In [0]:
from pyspark.sql import functions as F

documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     use_embeddings
   ])

empty_df = spark.createDataFrame([['']]).toDF("description")

pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(news_df.limit(10))

result_df = result.select(F.explode(F.arrays_zip(result.document.result, result.sentence_embeddings.embeddings)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("document"),
                          F.expr("cols['1']").alias("USE_embeddings"))

result_df.show(truncate=100)

+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                                                                                            document|                                                                                      USE_embeddings|
+----------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+
|                Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again.|[0.04415017, -6.5326475E-4, -0.013665826, -0.060485892, -0.07109088, 0.048674487, 0.08480666, -0....|
| Private investment firm Carlyle Group, which has a reputation for making well timed and occasion...|[0.08444513, 0.03535444, -0.0398393, 0.021572154, -0.09528031, 0.06693117, 0.08432

### Named Entity Recognition (NER) Models

for a detailed notebbok, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NERDL_Training.ipynb

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
      .setInputCols(["document", "token"])\
      .setOutputCol("embeddings")

onto_ner = NerDLModel.pretrained("onto_100", 'en') \
      .setInputCols(["document", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = NerConverter() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")


nlpPipeline = Pipeline(stages=[
       documentAssembler, 
       tokenizer,
       glove_embeddings,
       onto_ner,
       ner_converter
 ])

empty_df = spark.createDataFrame([['']]).toDF("description")

pipelineModel = nlpPipeline.fit(empty_df)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ | ][OK!]
onto_100 download started this may take some time.
Approximate size to download 13.5 MB
[ | ][OK!]


In [0]:
result = pipelineModel.transform(news_df.limit(10))

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+--------------------------------+---------+
|chunk                           |ner_label|
+--------------------------------+---------+
|Carlyle Group                   |ORG      |
|next week                       |DATE     |
|summer                          |DATE     |
|Iraq                            |GPE      |
|Saturday                        |DATE     |
|three months                    |DATE     |
|US                              |GPE      |
|Friday                          |DATE     |
|the year                        |DATE     |
|#36;46                          |CARDINAL |
|Dell Inc                        |ORG      |
|#36;1.17 billion                |CARDINAL |
|the latest week                 |DATE     |
|#36;849.98 trillion             |CARDINAL |
|the Investment Company Institute|ORG      |
|Thursday                        |DATE     |
|July                            |DATE     |
|last week                       |DATE     |
|Thursday                        |DATE     |
|midsummer

In [0]:
light_model = LightPipeline(pipelineModel)

light_result = light_model.fullAnnotate('Peter Parker is a nice persn and lives in New York. Bruce Wayne is also a nice guy and lives in Gotham City.')


chunks = []
entities = []

for n in light_result[0]['ner_chunk']:
        
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    
    
import pandas as pd

df = pd.DataFrame({'chunks':chunks, 'entities':entities})

df

Unnamed: 0,chunks,entities
0,Peter Parker,PERSON
1,New York,GPE
2,Bruce Wayne,PERSON
3,Gotham City,GPE


#### Train a NER model

**To train a new NER from scratch, check out**

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NERDL_Training.ipynb

In [0]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train

#dbutils.fs.cp("file:/databricks/driver/sample-sentences-en.txt", "dbfs:/")

In [0]:
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, 'file:/databricks/driver/eng.train')

training_data.show(3)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|                 pos|               label|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|EU rejects German...|[{document, 0, 47...|[{document, 0, 47...|[{token, 0, 1, EU...|[{pos, 0, 1, NNP,...|[{named_entity, 0...|
|     Peter Blackburn|[{document, 0, 14...|[{document, 0, 14...|[{token, 0, 4, Pe...|[{pos, 0, 4, NNP,...|[{named_entity, 0...|
| BRUSSELS 1996-08-22|[{document, 0, 18...|[{document, 0, 18...|[{token, 0, 7, BR...|[{pos, 0, 7, NNP,...|[{named_entity, 0...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 3 rows



In [0]:
training_data.select(F.explode(F.arrays_zip(training_data.token.result, training_data.label.result)).alias("cols"))\
              .select(F.expr("cols['0']").alias("token"),
                      F.expr("cols['1']").alias("ground_truth"))\
              .groupBy('ground_truth').count().orderBy('count', ascending=False).show(100,truncate=False)

+------------+------+
|ground_truth|count |
+------------+------+
|O           |169578|
|B-LOC       |7140  |
|B-PER       |6600  |
|B-ORG       |6321  |
|I-PER       |4528  |
|I-ORG       |3704  |
|B-MISC      |3438  |
|I-LOC       |1157  |
|I-MISC      |1155  |
+------------+------+



In [0]:
# You can use any word embeddings you want (Glove, Elmo, Bert, custom etc.)

glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
          .setInputCols(["document", "token"])\
          .setOutputCol("embeddings")

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ | ][OK!]


In [0]:
%fs mkdirs dbfs:/ner_logs

In [0]:
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(2)\
  .setLr(0.003)\
  .setPo(0.05)\
  .setBatchSize(32)\
  .setRandomSeed(0)\
  .setVerbose(1)\
  .setValidationSplit(0.2)\
  .setEvaluationLogExtended(True) \
  .setEnableOutputLogs(True)\
  .setIncludeConfidence(True)\
  .setOutputLogsPath('dbfs:/ner_logs') # if not set, logs will be written to ~/annotator_logs
 #.setGraphFolder('graphs') >> put your graph file (pb) under this folder if you are using a custom graph generated thru NerDL-Graph
    
    
ner_pipeline = Pipeline(stages=[
          glove_embeddings,
          nerTagger
 ])

In [0]:
# remove the existing logs

!rm -r /dbfs/ner_logs/*

In [0]:
ner_model = ner_pipeline.fit(training_data)

# 1 epoch takes around 2.5 min with batch size=32
# if you get an error for incompatible TF graph, use NERDL Graph script to generate the necessary TF graph at the end of this notebook

In [0]:
%sh cd /dbfs/ner_logs && pwd && ls -l

/dbfs/ner_logs
total 2
-rwxrwxrwx 1 root root 1832 Mar  2 10:40 NerDLApproach_99c7da711da7.log


In [0]:
%sh head -n 45 /dbfs/ner_logs/NerDLApproach_*

Name of the selected graph: ner-dl/blstm_10_100_128_120.pb
Training started - total epochs: 2 - lr: 0.003 - batch size: 32 - labels: 9 - chars: 83 - training examples: 11262


Epoch 1/2 started, lr: 0.003, dataset size: 11262


Epoch 1/2 - 41.82s - loss: 996.6914 - batches: 355
Quality on validation dataset (20.0%), validation examples = 2252
time to finish evaluation: 3.10s
label	 tp	 fp	 fn	 prec	 rec	 f1
B-LOC	 1278	 93	 135	 0.9321663	 0.9044586	 0.91810346
I-ORG	 625	 105	 149	 0.8561644	 0.80749357	 0.8311171
I-MISC	 151	 88	 66	 0.63179916	 0.6958525	 0.6622807
I-LOC	 184	 39	 47	 0.8251121	 0.7965368	 0.8105727
I-PER	 889	 33	 20	 0.96420825	 0.9779978	 0.9710541
B-MISC	 556	 86	 109	 0.8660436	 0.8360902	 0.8508033
B-ORG	 1100	 157	 187	 0.8750994	 0.85470086	 0.8647799
B-PER	 1256	 111	 40	 0.9188003	 0.9691358	 0.943297
tp: 6039 fp: 712 fn: 753 labels: 8
Macro-average	 prec: 0.85867417, rec: 0.85528326, f1: 0.8569753
Micro-average	 prec: 0.8945342, rec: 0.8891343, f1: 0.8918

In [0]:
%fs mkdirs dbfs:/models

In [0]:
ner_model.stages[1].write().overwrite().save('dbfs:/models/NER_glove_e2_b32')

In [0]:
%sh cd /dbfs/models/ && pwd && ls -l

/dbfs/models
total 12
drwxrwxrwx 2 root root 4096 Mar  2 10:15 NER_glove_e2_b32
drwxrwxrwx 2 root root 4096 Mar  2 10:15 fields
drwxrwxrwx 2 root root 4096 Mar  2 10:15 metadata


#### Load saved model

In [0]:
document = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentence = SentenceDetector()\
        .setInputCols(['document'])\
        .setOutputCol('sentence')

token = Tokenizer()\
        .setInputCols(['sentence'])\
        .setOutputCol('token')

glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
        .setInputCols(["document", "token"])\
        .setOutputCol("embeddings")
  
# load back and use in any pipeline
loaded_ner_model = NerDLModel.load("dbfs:/models/NER_glove_e2_b32")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

converter = NerConverter()\
        .setInputCols(["document", "token", "ner"])\
        .setOutputCol("ner_span")

ner_prediction_pipeline = Pipeline(stages = [
        document,
        sentence,
        token,
        glove_embeddings,
        loaded_ner_model,
        converter
])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[ | ][OK!]


In [0]:
empty_data = spark.createDataFrame([['']]).toDF("text")

prediction_model = ner_prediction_pipeline.fit(empty_data)


In [0]:
text = "Peter Parker is a nice guy and lives in New York."

sample_data = spark.createDataFrame([[text]]).toDF("text")

sample_data.show(truncate=False)

+-------------------------------------------------+
|text                                             |
+-------------------------------------------------+
|Peter Parker is a nice guy and lives in New York.|
+-------------------------------------------------+



In [0]:
preds = prediction_model.transform(sample_data)

preds.select(F.explode(F.arrays_zip(preds.ner_span.result,preds.ner_span.metadata)).alias("entities")) \
      .select(F.expr("entities['0']").alias("chunk"),
              F.expr("entities['1'].entity").alias("entity")).show(truncate=False)

+------------+------+
|chunk       |entity|
+------------+------+
|Peter Parker|PER   |
|New York    |LOC   |
+------------+------+



### Text Classification

for a detailed notebook, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb

In [0]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_test.csv
  
dbutils.fs.cp("file:/databricks/driver/news_category_test.csv", "dbfs:/") 

Out[66]: True

In [0]:
from pyspark.sql.functions import col

trainDataset = spark.read \
      .option("header", True) \
      .csv("/news_category_train.csv")

trainDataset.groupBy("category") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------+-----+
|category|count|
+--------+-----+
|   World|30000|
|Sci/Tech|30000|
|  Sports|30000|
|Business|30000|
+--------+-----+



In [0]:
testDataset = spark.read \
      .option("header", True) \
      .csv("/news_category_test.csv")


testDataset.groupBy("category") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------+-----+
|category|count|
+--------+-----+
|   World| 1900|
|Sci/Tech| 1900|
|  Sports| 1900|
|Business| 1900|
+--------+-----+



In [0]:
%fs mkdirs dbfs:/clf_dl_logs

In [0]:
# actual content is inside description column
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")
    
# we can also use sentece detector here if we want to train on and get predictions for each sentence

use_embeddings = UniversalSentenceEncoder.pretrained('tfhub_use')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in category column
classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setMaxEpochs(2)\
    .setBatchSize(8)\
    .setLr(0.001)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('dbfs:/clf_dl_logs') 

use_clf_pipeline = Pipeline(
    stages = [
        document,
        use_embeddings,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ][OK!]


In [0]:
# remove the existing logs

! rm -r /dbfs/clf_dl_logs/*

In [0]:
use_pipelineModel = use_clf_pipeline.fit(trainDataset)
# 5 epochs takes around 3 min

In [0]:
%sh cd /dbfs/clf_dl_logs/ && ls -lt

total 1
-rwxrwxrwx 1 root root 251 Mar  2 10:43 ClassifierDLApproach_15a33392860b.log


In [0]:
%sh cat  /dbfs/clf_dl_logs/ClassifierDLApproach*


Training started - epochs: 2 - learning_rate: 0.001 - batch_size: 8 - training_examples: 120000 - classes: 4
Epoch 0/2 - 42.71s - loss: 12922.643 - acc: 0.88300836 - batches: 15000
Epoch 1/2 - 45.96s - loss: 12776.869 - acc: 0.891975 - batches: 15000


In [0]:
from sparknlp.base import LightPipeline

light_model = LightPipeline(use_pipelineModel)

text='''
Fearing the fate of Italy, the centre-right government has threatened to be merciless with those who flout tough restrictions. 
As of Wednesday it will also include all shops being closed across Greece, with the exception of supermarkets. Banks, pharmacies, pet-stores, mobile phone stores, opticians, bakers, mini-markets, couriers and food delivery outlets are among the few that will also be allowed to remain open.
'''
result = light_model.annotate(text)

result['class']

Out[74]: ['Business']

In [0]:
light_model.annotate('the soccer games will be postponed.')


Out[75]: {'document': ['the soccer games will be postponed.'],
 'sentence_embeddings': ['the soccer games will be postponed.'],
 'class': ['Sports']}

End of Notebook #