![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# 1.Quickstart Tutorial on Spark NLP - 1 hr

This is the 1 hr workshop version of the entire training notebooks : https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public

an intro article for Spark NLP:

https://towardsdatascience.com/introduction-to-spark-nlp-foundations-and-basic-components-part-i-c83b7629ed59

How to start Spark NLP in 2 weeks:

https://towardsdatascience.com/how-to-get-started-with-sparknlp-in-2-weeks-cb47b2ba994d

https://towardsdatascience.com/how-to-wrap-your-head-around-spark-nlp-a6f6a968b7e8

Article for NER and text classification in Spark NLP

https://towardsdatascience.com/named-entity-recognition-ner-with-bert-in-spark-nlp-874df20d1d77

https://medium.com/spark-nlp/named-entity-recognition-for-healthcare-with-sparknlp-nerdl-and-nercrf-a7751b6ad571

https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32

a webinar to show how to train a NER model from scratch (90 min)

https://www.youtube.com/watch?v=djWX0MR2Ooo

workshop repo that you can start playing with Spark NLP in Colab
(you will also see Databricks notebook under each folder)

https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings

## Coding ...

In [0]:
import sparknlp

from sparknlp.base import *
from sparknlp.annotator import *

from pyspark.ml import Pipeline

print("Spark NLP version", sparknlp.version())

spark

## Using Pretrained Pipelines

for a more detailed notebook, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/1.SparkNLP_Basics.ipynb

In [0]:
from sparknlp.pretrained import PretrainedPipeline

pipeline_dl = PretrainedPipeline('explain_document_dl', lang='en')


**Stages**
- DocumentAssembler
- SentenceDetector
- Tokenizer
- NER (NER with GloVe 100D embeddings, CoNLL2003 dataset)
- Lemmatizer
- Stemmer
- Part of Speech
- SpellChecker (Norvig)

In [0]:
testDoc = '''
Peter Parker is a very good persn.
My life in Russia is very intersting.
John and Peter are brthers. However they don't support each other that much.
Mercedes Benz is also working on a driverless car.
Europe is very culture rich. There are huge churches! and big houses!
'''

result = pipeline_dl.annotate(testDoc)


In [0]:
result.keys()

In [0]:
result['entities']

In [0]:
import pandas as pd

df = pd.DataFrame({'token':result['token'], 'ner_label':result['ner'],
                      'spell_corrected':result['checked'], 'POS':result['pos'],
                      'lemmas':result['lemma'], 'stems':result['stem']})

df

Unnamed: 0,token,ner_label,spell_corrected,POS,lemmas,stems
0,Peter,B-PER,Peter,NNP,Peter,peter
1,Parker,I-PER,Parker,NNP,Parker,parker
2,is,O,is,VBZ,be,i
3,a,O,a,DT,a,a
4,very,O,very,RB,very,veri
5,good,O,good,JJ,good,good
6,persn,O,person,NN,person,person
7,.,O,.,.,.,.
8,My,O,My,PRP$,My,my
9,life,O,life,NN,life,life


### Using fullAnnotate to get more details

In [0]:
detailed_result = pipeline_dl.fullAnnotate(testDoc)

detailed_result[0]['entities']

In [0]:
chunks=[]
entities=[]
for n in detailed_result[0]['entities']:
        
  chunks.append(n.result)
  entities.append(n.metadata['entity']) 
    
df = pd.DataFrame({'chunks':chunks, 'entities':entities})
df    

Unnamed: 0,chunks,entities
0,Peter Parker,PER
1,Russia,LOC
2,John,PER
3,Peter,PER
4,Mercedes Benz,ORG
5,Europe,LOC


In [0]:
tuples = []

for x,y,z in zip(detailed_result[0]["token"], detailed_result[0]["pos"], detailed_result[0]["ner"]):

  tuples.append((int(x.metadata['sentence']), x.result, x.begin, x.end, y.result, z.result))

df = pd.DataFrame(tuples, columns=['sent_id','token','start','end','pos', 'ner'])

df


Unnamed: 0,sent_id,token,start,end,pos,ner
0,0,Peter,1,5,NNP,B-PER
1,0,Parker,7,12,NNP,I-PER
2,0,is,14,15,VBZ,O
3,0,a,17,17,DT,O
4,0,very,19,22,RB,O
5,0,good,24,27,JJ,O
6,0,persn,29,33,NN,O
7,0,.,34,34,.,O
8,1,My,36,37,PRP$,O
9,1,life,39,42,NN,O


### Sentiment Analysis

In [0]:
sentiment = PretrainedPipeline('analyze_sentiment', lang='en')

In [0]:
result = sentiment.annotate("The movie I watched today was not a good one")

result['sentiment']

In [0]:
sentiment_imdb_glove = PretrainedPipeline('analyze_sentimentdl_glove_imdb', lang='en')

In [0]:
comment = '''
It's a very scary film but what impressed me was how true the film sticks to the original's tricks; it isn't filled with loud in-your-face jump scares, in fact, a lot of what makes this film scary is the slick cinematography and intricate shadow play. The use of lighting and creation of atmosphere is what makes this film so tense, which is why it's perfectly suited for those who like Horror movies but without the obnoxious gore.
'''
result = sentiment_imdb_glove.annotate(comment)

result['sentiment']

## Using the modules in a pipeline for custom tasks

for a more detailed notebook, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/2.Text_Preprocessing_with_SparkNLP_Annotators_Transformers.ipynb

In [0]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/annotation/english/spark-nlp-basics/sample-sentences-en.txt
  
dbutils.fs.cp("file:/databricks/driver/sample-sentences-en.txt", "dbfs:/") 

In [0]:
with open('sample-sentences-en.txt') as f:
  print (f.read())

In [0]:
spark_df = spark.read.text('/sample-sentences-en.txt').toDF('text')

spark_df.show(truncate=False)

In [0]:
textFiles = spark.sparkContext.wholeTextFiles("/sample-sentences-en.txt",4) # or/*.txt
    
spark_df_folder = textFiles.toDF(schema=['path','text'])

spark_df_folder.show(truncate=30)

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentences')

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("token")

nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     sentenceDetector,
     tokenizer
 ])

empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

In [0]:
result = pipelineModel.transform(spark_df)


In [0]:
result.show(truncate=20)


In [0]:
result.printSchema()


In [0]:
result.select('sentences.result').take(3)


### StopWords Cleaner

In [0]:
stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("token")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

In [0]:
stopwords_cleaner.getStopWords()[:10]

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     tokenizer,
     stopwords_cleaner
 ])

empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(spark_df)

result.show()

In [0]:
result.select('cleanTokens.result').take(1)


### Text Matcher

In [0]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv
  
dbutils.fs.cp("file:/databricks/driver/news_category_train.csv", "dbfs:/")

In [0]:
news_df = spark.read \
      .option("header", True) \
      .csv("/news_category_train.csv")


news_df.show(5, truncate=50)

In [0]:
entities = ['Wall Street', 'USD', 'stock', 'NYSE']
with open ('financial_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')


entities = ['soccer', 'world cup', 'Messi', 'FC Barcelona']
with open ('sport_entities.txt', 'w') as f:
    for i in entities:
        f.write(i+'\n')


In [0]:
dbutils.fs.cp("file:/databricks/driver/financial_entities.txt", "dbfs:/")
dbutils.fs.cp("file:/databricks/driver/sport_entities.txt", "dbfs:/")

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

financial_entity_extractor = TextMatcher() \
    .setInputCols(["document",'token'])\
    .setOutputCol("financial_entities")\
    .setEntities("file:/databricks/driver/financial_entities.txt")\
    .setCaseSensitive(False)\
    .setEntityValue('financial_entity')

sport_entity_extractor = TextMatcher() \
    .setInputCols(["document",'token'])\
    .setOutputCol("sport_entities")\
    .setEntities("file:/databricks/driver/sport_entities.txt")\
    .setCaseSensitive(False)\
    .setEntityValue('sport_entity')


nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     tokenizer,
     financial_entity_extractor,
     sport_entity_extractor
 ])

empty_df = spark.createDataFrame([['']]).toDF("description")

pipelineModel = nlpPipeline.fit(empty_df)

In [0]:
result = pipelineModel.transform(news_df)


In [0]:
result.select('financial_entities.result','sport_entities.result').take(2)


This means there are no financial and sport entities in the first two lines.

In [0]:
from pyspark.sql import functions as F
result.select('description','financial_entities.result','sport_entities.result')\
.toDF('text','financial_matches','sport_matches').filter((F.size('financial_matches')>1) | (F.size('sport_matches')>1))\
.show(truncate=70)


### Using the pipeline in a LightPipeline

In [0]:
light_model = LightPipeline(pipelineModel)

light_result = light_model.fullAnnotate("Google, Inc. significantly cut the expected share price for its stock at Wall Street")

light_result[0]['financial_entities']

## Pretrained Models

Spark NLP offers the following pre-trained models in around **40 languages** and all you need to do is to load the pre-trained model into your disk by specifying the model name and then configuring the model parameters as per your use case and dataset. Then you will not need to worry about training a new model from scratch and will be able to enjoy the pre-trained SOTA algorithms directly applied to your own data with transform().

In the official documentation, you can find detailed information regarding how these models are trained by using which algorithms and datasets.

https://github.com/JohnSnowLabs/spark-nlp-models

for a more detailed notebook, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/3.SparkNLP_Pretrained_Models.ipynb

### LemmatizerModel and ContextSpellCheckerModel

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_dl')\
    .setInputCols("token")\
    .setOutputCol("checked")

lemmatizer = LemmatizerModel.pretrained('lemma_antbnc', 'en') \
    .setInputCols(["checked"]) \
    .setOutputCol("lemma")

pipeline = Pipeline(stages = [
    documentAssembler,
    tokenizer,
    spellModel,
    lemmatizer
  ])

empty_ds = spark.createDataFrame([[""]]).toDF("text")

sc_model = pipeline.fit(empty_ds)

lp = LightPipeline(sc_model)

In [0]:
result = lp.annotate("Plaese alliow me tao introdduce myhelf, I am a man of waelth und tiaste and he just knows that")

list(zip(result['token'],result['checked'],result['lemma']))

### Word and Sentence Embeddings

#### Word Embeddings

In [0]:
glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
      .setInputCols(["document", "token"])\
      .setOutputCol("embeddings")

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     tokenizer,
     glove_embeddings
 ])

empty_df = spark.createDataFrame([['']]).toDF("description")
pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(news_df.limit(1))

output = result.select('token.result','embeddings.embeddings').limit(1).rdd.flatMap(lambda x: x).collect()


In [0]:
pd.DataFrame({'token':output[0],'embeddings':output[1]})

Unnamed: 0,token,embeddings
0,Short,"[-0.4308899939060211, -0.023907000198960304, -..."
1,sellers,"[0.1458200067281723, 0.2753300070762634, -0.20..."
2,",","[-0.10767000168561935, 0.11052999645471573, 0...."
3,Wall,"[0.21383999288082123, 0.22098000347614288, 0.0..."
4,Street's,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
5,dwindling,"[-0.5611299872398376, 1.1217999458312988, 0.65..."
6,band,"[-0.12160000205039978, -0.24347999691963196, 0..."
7,of,"[-0.15289999544620514, -0.24278999865055084, 0..."
8,ultra,"[-0.3504500091075897, -0.27733999490737915, 0...."
9,cynics,"[-0.06557200103998184, 0.45271000266075134, 0...."


In [0]:
result = pipelineModel.transform(news_df.limit(10))

result_df = result.select(F.explode(F.arrays_zip(result.token.result, result.embeddings.embeddings)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("embeddings"))

result_df.show(10, truncate=100)

#### Bert Embeddings

In [0]:
bert_embeddings = BertEmbeddings.pretrained('bert_base_cased')\
      .setInputCols(["document", "token"])\
      .setOutputCol("embeddings")

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

 
nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     tokenizer,
     bert_embeddings
 ])

empty_df = spark.createDataFrame([['']]).toDF("description")

pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(news_df.limit(10))

result_df = result.select(F.explode(F.arrays_zip(result.token.result, result.embeddings.embeddings)).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("bert_embeddings"))

result_df.show(truncate=100)

#### Bert Sentence Embeddings

In [0]:
bert_sentence_embeddings = BertSentenceEmbeddings.pretrained('sent_small_bert_L6_128')\
    .setInputCols(["document"])\
    .setOutputCol("bert_sent_embeddings")


nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     bert_sentence_embeddings
 ])


empty_df = spark.createDataFrame([['']]).toDF("description")

pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(news_df.limit(10))

result_df = result.select(F.explode(F.arrays_zip(result.document.result, result.bert_sent_embeddings.embeddings)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("document"),
                          F.expr("cols['1']").alias("bert_sent_embeddings"))

result_df.show(truncate=100)

#### Universal Sentence Encoder

In [0]:
# no need for token columns 
use_embeddings = UniversalSentenceEncoder.pretrained('tfhub_use')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

In [0]:
from pyspark.sql import functions as F

documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

nlpPipeline = Pipeline(stages=[
     documentAssembler, 
     use_embeddings
   ])

empty_df = spark.createDataFrame([['']]).toDF("description")

pipelineModel = nlpPipeline.fit(empty_df)

result = pipelineModel.transform(news_df.limit(10))

result_df = result.select(F.explode(F.arrays_zip(result.document.result, result.sentence_embeddings.embeddings)).alias("cols"))\
                  .select(F.expr("cols['0']").alias("document"),
                          F.expr("cols['1']").alias("USE_embeddings"))

result_df.show(truncate=100)

### Named Entity Recognition (NER) Models

for a detailed notebbok, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NERDL_Training.ipynb

In [0]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
      .setInputCols(["document", "token"])\
      .setOutputCol("embeddings")

onto_ner = NerDLModel.pretrained("onto_100", 'en') \
      .setInputCols(["document", "token", "embeddings"]) \
      .setOutputCol("ner")

ner_converter = NerConverter() \
      .setInputCols(["document", "token", "ner"]) \
      .setOutputCol("ner_chunk")


nlpPipeline = Pipeline(stages=[
       documentAssembler, 
       tokenizer,
       glove_embeddings,
       onto_ner,
       ner_converter
 ])

empty_df = spark.createDataFrame([['']]).toDF("description")

pipelineModel = nlpPipeline.fit(empty_df)

In [0]:
result = pipelineModel.transform(news_df.limit(10))

result.select(F.explode(F.arrays_zip(result.ner_chunk.result, result.ner_chunk.metadata)).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

In [0]:
light_model = LightPipeline(pipelineModel)

light_result = light_model.fullAnnotate('Peter Parker is a nice persn and lives in New York. Bruce Wayne is also a nice guy and lives in Gotham City.')


chunks = []
entities = []

for n in light_result[0]['ner_chunk']:
        
    chunks.append(n.result)
    entities.append(n.metadata['entity']) 
    
    
import pandas as pd

df = pd.DataFrame({'chunks':chunks, 'entities':entities})

df

Unnamed: 0,chunks,entities
0,Peter Parker,PERSON
1,New York,GPE
2,Bruce Wayne,PERSON
3,Gotham City,GPE


#### Train a NER model

**To train a new NER from scratch, check out**

https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/4.NERDL_Training.ipynb

In [0]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/conll2003/eng.train

#dbutils.fs.cp("file:/databricks/driver/sample-sentences-en.txt", "dbfs:/")

In [0]:
from sparknlp.training import CoNLL

training_data = CoNLL().readDataset(spark, 'file:/databricks/driver/eng.train')

training_data.show(3)

In [0]:
training_data.select(F.explode(F.arrays_zip(training_data.token.result, training_data.label.result)).alias("cols"))\
              .select(F.expr("cols['0']").alias("token"),
                      F.expr("cols['1']").alias("ground_truth"))\
              .groupBy('ground_truth').count().orderBy('count', ascending=False).show(100,truncate=False)

In [0]:
# You can use any word embeddings you want (Glove, Elmo, Bert, custom etc.)

glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
          .setInputCols(["document", "token"])\
          .setOutputCol("embeddings")

In [0]:
%fs mkdirs dbfs:/ner_logs

In [0]:
nerTagger = NerDLApproach()\
  .setInputCols(["sentence", "token", "embeddings"])\
  .setLabelColumn("label")\
  .setOutputCol("ner")\
  .setMaxEpochs(2)\
  .setLr(0.003)\
  .setPo(0.05)\
  .setBatchSize(32)\
  .setRandomSeed(0)\
  .setVerbose(1)\
  .setValidationSplit(0.2)\
  .setEvaluationLogExtended(True) \
  .setEnableOutputLogs(True)\
  .setIncludeConfidence(True)\
  .setOutputLogsPath('dbfs:/ner_logs') # if not set, logs will be written to ~/annotator_logs
 #.setGraphFolder('graphs') >> put your graph file (pb) under this folder if you are using a custom graph generated thru NerDL-Graph
    
    
ner_pipeline = Pipeline(stages=[
          glove_embeddings,
          nerTagger
 ])

In [0]:
# remove the existing logs

!rm -r /dbfs/ner_logs/*

In [0]:
ner_model = ner_pipeline.fit(training_data)

# 1 epoch takes around 2.5 min with batch size=32
# if you get an error for incompatible TF graph, use NERDL Graph script to generate the necessary TF graph at the end of this notebook

In [0]:
#%sh cd ~/annotator_logs && ls -lt

In [0]:
%sh cd /dbfs/ner_logs && pwd && ls -l

In [0]:
%sh head -n 45 /dbfs/ner_logs/NerDLApproach_*

In [0]:
%fs mkdirs dbfs:/models

In [0]:
ner_model.stages[1].write().overwrite().save('dbfs:/models/NER_glove_e1_b32')

In [0]:
%sh cd /dbfs/models/ && pwd && ls -l

#### Load saved model

In [0]:
document = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentence = SentenceDetector()\
        .setInputCols(['document'])\
        .setOutputCol('sentence')

token = Tokenizer()\
        .setInputCols(['sentence'])\
        .setOutputCol('token')

glove_embeddings = WordEmbeddingsModel.pretrained('glove_100d')\
        .setInputCols(["document", "token"])\
        .setOutputCol("embeddings")
  
# load back and use in any pipeline
loaded_ner_model = NerDLModel.load("dbfs:/models/NER_glove_e1_b32")\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setOutputCol("ner")

converter = NerConverter()\
        .setInputCols(["document", "token", "ner"])\
        .setOutputCol("ner_span")

ner_prediction_pipeline = Pipeline(stages = [
        document,
        sentence,
        token,
        glove_embeddings,
        loaded_ner_model,
        converter
])

In [0]:
empty_data = spark.createDataFrame([['']]).toDF("text")

prediction_model = ner_prediction_pipeline.fit(empty_data)


In [0]:
text = "Peter Parker is a nice guy and lives in New York."

sample_data = spark.createDataFrame([[text]]).toDF("text")

sample_data.show()

In [0]:
preds = prediction_model.transform(sample_data)

preds.select(F.explode(F.arrays_zip(preds.ner_span.result,preds.ner_span.metadata)).alias("entities")) \
      .select(F.expr("entities['0']").alias("chunk"),
              F.expr("entities['1'].entity").alias("entity")).show(truncate=False)

### Text Classification

for a detailed notebook, see https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb

In [0]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_test.csv
  
dbutils.fs.cp("file:/databricks/driver/news_category_test.csv", "dbfs:/") 

In [0]:
from pyspark.sql.functions import col

trainDataset = spark.read \
      .option("header", True) \
      .csv("/news_category_train.csv")

trainDataset.groupBy("category") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

In [0]:
testDataset = spark.read \
      .option("header", True) \
      .csv("/news_category_test.csv")


testDataset.groupBy("category") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

In [0]:
%fs mkdirs dbfs:/clf_dl_logs

In [0]:
# actual content is inside description column
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")
    
# we can also use sentece detector here if we want to train on and get predictions for each sentence

use_embeddings = UniversalSentenceEncoder.pretrained('tfhub_use')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in category column
classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setMaxEpochs(5)\
    .setBatchSize(8)\
    .setLr(0.001)\
    .setEnableOutputLogs(True)\
    .setOutputLogsPath('dbfs:/clf_dl_logs') 

use_clf_pipeline = Pipeline(
    stages = [
        document,
        use_embeddings,
        classsifierdl
    ])

In [0]:
# remove the existing logs

! rm -r /dbfs/clf_dl_logs/*

In [0]:
use_pipelineModel = use_clf_pipeline.fit(trainDataset)
# 5 epochs takes around 3 min

In [0]:
%sh cd /dbfs/clf_dl_logs/ && ls -lt

In [0]:
%sh cat  /dbfs/clf_dl_logs/ClassifierDLApproach*


In [0]:
from sparknlp.base import LightPipeline

light_model = LightPipeline(use_pipelineModel)

text='''
Fearing the fate of Italy, the centre-right government has threatened to be merciless with those who flout tough restrictions. 
As of Wednesday it will also include all shops being closed across Greece, with the exception of supermarkets. Banks, pharmacies, pet-stores, mobile phone stores, opticians, bakers, mini-markets, couriers and food delivery outlets are among the few that will also be allowed to remain open.
'''
result = light_model.annotate(text)

result['class']

In [0]:
light_model.annotate('the soccer games will be postponed.')


# NerDL Graph

In [0]:
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/training/english/dl-ner/nerdl-graph/create_graph.py
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/training/english/dl-ner/nerdl-graph/dataset_encoder.py
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/training/english/dl-ner/nerdl-graph/ner_model.py
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/training/english/dl-ner/nerdl-graph/ner_model_saver.py
!wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jupyter/training/english/dl-ner/nerdl-graph/sentence_grouper.py
import sys
sys.path.append('/databricks/driver/')
sys.path.append('/databricks/driver/create_graph.py')
import create_graph

ntags = 12 # number of labels
embeddings_dim = 90
nchars =60
create_graph.create_graph(ntags, embeddings_dim, nchars)

End of Notebook #