![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP
### Multi-class Sentiment Classification
#### By using SentimentDL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/classification/SentimentDL_Train_and_Evaluate.ipynb)

Only run this block if you are inside Google Colab otherwise skip it

In [None]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In this notebook we are going to check the training logs on the fly. Thus, we start a session with `real_time_output=True`

In [5]:
import sparknlp

spark = sparknlp.start(real_time_output=True)

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version", spark.version)


Spark NLP version:  4.1.0
Apache Spark version 3.2.1


Let's download IMDB movie reviews dataset for training our multi-class sentiment classifier

In [6]:
!wget -O aclimdb_train.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_train.csv

--2022-09-23 19:33:40--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_train.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.230.64
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.230.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33497180 (32M) [text/csv]
Saving to: ‘aclimdb_train.csv’


2022-09-23 19:33:42 (20.5 MB/s) - ‘aclimdb_train.csv’ saved [33497180/33497180]



In [7]:
!wget -O aclimdb_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_test.csv

--2022-09-23 19:33:42--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_test.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.81.83
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.81.83|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32715164 (31M) [text/csv]
Saving to: ‘aclimdb_test.csv’


2022-09-23 19:33:44 (21.3 MB/s) - ‘aclimdb_test.csv’ saved [32715164/32715164]



In [8]:
!head -n 5 aclimdb_train.csv

text,label
"This is an Excellent little movie! The acting is good and the music is fantastic!! Play it on a 5-1 sound system and enjoy! It will never win any awards but its good clean fun for all!! I recommend this movie to all fans of pretty girls funny and hansom men as well as robot lovers everyone!!1 P.S. It also stars Lisa Rinna! Enjoy!!This is a very hard movie to find, It is out of print. I first saw it on Showtime many years ago but recently found a used VHS copy. Its still a must see for all!!!This is an Excellent little movie! The acting is good and the music is fantastic!! Play it on a 5-1 sound system and enjoy! It will never win any awards but its good clean fun for all!! I recommend this movie to all fans of pretty girls funny and hansom men as well as robot lovers everyone!!1 P.S. It also stars Lisa Rinna! Enjoy!! Dave Engle This is a very hard movie to find, It is out of print. I first saw it on Showtime many years ago but recently found a used VHS copy. Its still a mus

The content is inside `text` column and the sentiment is inside `label` column

In [9]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("aclimdb_train.csv")

In [10]:
trainDataset.show()

+--------------------+--------+
|                text|   label|
+--------------------+--------+
|This is an Excell...|positive|
|The Sarah Silverm...|positive|
|"Prom Night" is a...|negative|
|So often a band w...|positive|
|"Pet Sematary" is...|positive|
|I watched the fil...|negative|
|Boy this movie ha...|negative|
|Checking the spoi...|negative|
|Despite its rathe...|positive|
|Absolute masterpi...|positive|
|The tweedy profes...|positive|
|A movie best summ...|negative|
|Take young, prett...|negative|
|For months I've b...|negative|
|"Batman: The Myst...|positive|
|Well, it was funn...|negative|
|I have seen the s...|positive|
|Brainless film ab...|negative|
|Leave it to geniu...|negative|
|Seven Pounds star...|positive|
+--------------------+--------+
only showing top 20 rows



In [11]:
trainDataset.count()

25000

# Evaluation 

Let's evaluate our SentimentDL model during training, saved it, and loaded it into a new pipeline by using a test dataset that model has never seen. To do this we first need to prepare a test dataset parquet file as shown below:

In [12]:
imdb_test_dataset = spark.read \
      .option("header", True) \
      .csv("aclimdb_test.csv")

In [13]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [14]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

pipeline = Pipeline(stages = [document,use])

test_dataset = pipeline.fit(imdb_test_dataset).transform(imdb_test_dataset)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ]tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ \ ]Download done! Loading the resource.
[OK!]


In [15]:
test_dataset.show(2)

+--------------------+--------+--------------------+--------------------+
|                text|   label|            document| sentence_embeddings|
+--------------------+--------+--------------------+--------------------+
|The Second Woman ...|negative|[{document, 0, 11...|[{sentence_embedd...|
|In my opinion the...|positive|[{document, 0, 14...|[{sentence_embedd...|
+--------------------+--------+--------------------+--------------------+
only showing top 2 rows



Now, that out test dataset has the required embeddings, we save it as parquet and use it while training our SentimentDL model.

In [16]:
test_dataset.write.parquet("./aclimdb_test.parquet")

Now let's train it and use a validation and the test dataset above for evaluation

In [17]:
# the classes/labels/categories are in category column
sentimentdl = SentimentDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("label")\
  .setMaxEpochs(5)\
  .setEnableOutputLogs(True)\
  .setValidationSplit(0.2) \
  .setEvaluationLogExtended(True) \
  .setTestDataset("./aclimdb_test.parquet")

pipeline = Pipeline(
    stages = [
        document,
        use,
        sentimentdl
    ])

In [18]:
pipelineModel = pipeline.fit(trainDataset)

Training started - epochs: 5 - learning_rate: 0.005 - batch_size: 64 - training_examples: 20000
Epoch 1/5 - 4.03s - loss: 152.93646 - acc: 0.8290765 - batches: 313
Quality on validation dataset (20.0%), validation examples = 5000
time to finish evaluation: 0.34s
label        tp	 fp	 fn	 prec	 rec	 f1
1            2198	 477	 280	 0.8216822	 0.8870056	 0.8530953
0            2045	 280	 477	 0.8795699	 0.8108644	 0.8438209
tp: 4243 fp: 757 fn: 757 labels: 2
Macro-average	 prec: 0.85062605, rec: 0.848935, f1: 0.84977967
Micro-average	 prec: 0.8486, recall: 0.8486, f1: 0.8486
Quality on test dataset: 
time to finish evaluation: 1.18s
label        tp	 fp	 fn	 prec	 rec	 f1
1            10943	 2350	 1557	 0.82321525	 0.87544	 0.84852475
0            10150	 1557	 2350	 0.86700267	 0.812	 0.8386004
tp: 21093 fp: 3907 fn: 3907 labels: 2
Macro-average	 prec: 0.845109, rec: 0.84371996, f1: 0.84441394
Micro-average	 prec: 0.84372, recall: 0.84372, f1: 0.84372
Epoch 2/5 - 4.63s - loss: 144.18388 - a

# How to use already trained SentimentDL pipeline or its model

We have two ways of using what we already trained: pipeline or model.

Let's see how we can save the entire pipeline, load it, and do some prediction with that pre-trained pipeline.

## Save and load pre-trained SentimentDL pipeline

In [19]:
# Google Colab is free so it comes with a little memory. 
# It's not possible to save and load in this notebook. But you can do this locally or in a decent machine!

# pipelineModel.save("./sentimentdl_pipeline")
# loadedPipeline = PipelineModel.load("./sentimentdl_pipeline")
# loadedPipeline.transform(YOUR_DATAFRAME)

# Save and load pre-trained SentimentDL model

In [20]:
# hdfs:/ if you are saving it on distributed file systems in Hadoop
pipelineModel.stages[-1].write().overwrite().save('./tmp_sentimentdl_model')



Let's use our pre-trained SentimentDLModel in a pipeline: 

In [21]:

# In a new pipeline you can load it for prediction
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

sentimentdl = SentimentDLModel.load("./tmp_sentimentdl_model") \
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")

pipeline = Pipeline(
    stages = [
        document,
        use,
        sentimentdl
    ])


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Now let's load it back so we can have prediction all together with everything in that pipeline:

In [22]:
from pyspark.sql.types import StringType

dfTest = spark.createDataFrame([
    "This movie is a delight for those of all ages. I have seen it several times and each time I am enchanted by the characters and magic. The cast is outstanding, the special effects delightful, everything most believable.",
    "This film was to put it simply rubbish. The child actors couldn't act, as can be seen by Harry's supposed surprise on learning he's a wizard. I'm a wizard! is said with such indifference you'd think he's not surprised at all."
], StringType()).toDF("text")

In [23]:
prediction = pipeline.fit(dfTest).transform(dfTest)

In [24]:
prediction.select("class.result").show()

prediction.select("class.metadata").show(truncate=False)

+----------+
|    result|
+----------+
|[positive]|
|[negative]|
+----------+

+------------------------------------------------------------------+
|metadata                                                          |
+------------------------------------------------------------------+
|[{sentence -> 0, positive -> 1.0, negative -> 2.8792261E-8}]      |
|[{sentence -> 0, positive -> 1.8572706E-5, negative -> 0.9999814}]|
+------------------------------------------------------------------+

