![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/training/english/classification/SentimentDL_Train_and_Evaluate.ipynb)

# Multi-class Sentiment Classification using SentimentDL

Only run this block if you are inside Google Colab otherwise skip it

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In this notebook we are going to check the training logs on the fly. Thus, we start a session with `real_time_output=True`

In [None]:
import sparknlp

spark = sparknlp.start(real_time_output=True)

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version", spark.version)


23/02/20 18:01:18 WARN Utils: Your hostname, duc-manjaro resolves to a loopback address: 127.0.1.1; using 192.168.0.34 instead (on interface enp3s0)
23/02/20 18:01:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Spark NLP version:  4.3.1
Apache Spark version 3.3.0
:: loading settings :: url = jar:file:/home/root/.conda/envs/sparknlp/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
23/02/20 18:01:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Let's download IMDB movie reviews dataset for training our multi-class sentiment classifier

In [None]:
!wget -O aclimdb_train.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_train.csv

--2023-02-20 18:01:23--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_train.csv
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.141.0, 52.216.18.187, 52.217.165.168, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.141.0|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33497180 (32M) [text/csv]
Saving to: ‘aclimdb_train.csv’


2023-02-20 18:01:27 (11,0 MB/s) - ‘aclimdb_train.csv’ saved [33497180/33497180]



In [None]:
!wget -O aclimdb_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_test.csv

--2023-02-20 18:01:27--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_test.csv
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.172.80, 52.216.233.189, 52.216.171.29, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.172.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32715164 (31M) [text/csv]
Saving to: ‘aclimdb_test.csv’


2023-02-20 18:01:30 (15,8 MB/s) - ‘aclimdb_test.csv’ saved [32715164/32715164]



In [None]:
!head -n 5 aclimdb_train.csv

text,label
"This is an Excellent little movie! The acting is good and the music is fantastic!! Play it on a 5-1 sound system and enjoy! It will never win any awards but its good clean fun for all!! I recommend this movie to all fans of pretty girls funny and hansom men as well as robot lovers everyone!!1 P.S. It also stars Lisa Rinna! Enjoy!!This is a very hard movie to find, It is out of print. I first saw it on Showtime many years ago but recently found a used VHS copy. Its still a must see for all!!!This is an Excellent little movie! The acting is good and the music is fantastic!! Play it on a 5-1 sound system and enjoy! It will never win any awards but its good clean fun for all!! I recommend this movie to all fans of pretty girls funny and hansom men as well as robot lovers everyone!!1 P.S. It also stars Lisa Rinna! Enjoy!! Dave Engle This is a very hard movie to find, It is out of print. I first saw it on Showtime many years ago but recently found a used VHS copy. Its still a mus

The content is inside `text` column and the sentiment is inside `label` column

In [None]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("aclimdb_train.csv")

In [None]:
trainDataset.show()

+--------------------+--------+
|                text|   label|
+--------------------+--------+
|This is an Excell...|positive|
|The Sarah Silverm...|positive|
|"Prom Night" is a...|negative|
|So often a band w...|positive|
|"Pet Sematary" is...|positive|
|I watched the fil...|negative|
|Boy this movie ha...|negative|
|Checking the spoi...|negative|
|Despite its rathe...|positive|
|Absolute masterpi...|positive|
|The tweedy profes...|positive|
|A movie best summ...|negative|
|Take young, prett...|negative|
|For months I've b...|negative|
|"Batman: The Myst...|positive|
|Well, it was funn...|negative|
|I have seen the s...|positive|
|Brainless film ab...|negative|
|Leave it to geniu...|negative|
|Seven Pounds star...|positive|
+--------------------+--------+
only showing top 20 rows



In [None]:
trainDataset.count()

25000

# Evaluation 

Let's evaluate our SentimentDL model during training, saved it, and loaded it into a new pipeline by using a test dataset that model has never seen. To do this we first need to prepare a test dataset parquet file as shown below:

In [None]:
imdb_test_dataset = spark.read \
      .option("header", True) \
      .csv("aclimdb_test.csv")

In [None]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

pipeline = Pipeline(stages = [document,use])

test_dataset = pipeline.fit(imdb_test_dataset).transform(imdb_test_dataset)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ]tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ / ]Download done! Loading the resource.
[OK!]


In [None]:
test_dataset.show(2)

+--------------------+--------+--------------------+--------------------+
|                text|   label|            document| sentence_embeddings|
+--------------------+--------+--------------------+--------------------+
|The Second Woman ...|negative|[{document, 0, 11...|[{sentence_embedd...|
|In my opinion the...|positive|[{document, 0, 14...|[{sentence_embedd...|
+--------------------+--------+--------------------+--------------------+
only showing top 2 rows



Now, that out test dataset has the required embeddings, we save it as parquet and use it while training our SentimentDL model.

In [None]:
test_dataset.write.parquet("./aclimdb_test.parquet")

Now let's train it and use a validation and the test dataset above for evaluation

In [None]:
# the classes/labels/categories are in category column
sentimentdl = SentimentDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("label")\
  .setMaxEpochs(5)\
  .setEnableOutputLogs(True)\
  .setValidationSplit(0.2) \
  .setEvaluationLogExtended(True) \
  .setTestDataset("./aclimdb_test.parquet")

pipeline = Pipeline(
    stages = [
        document,
        use,
        sentimentdl
    ])

In [None]:
pipelineModel = pipeline.fit(trainDataset)

Training started - epochs: 5 - learning_rate: 0.005 - batch_size: 64 - training_examples: 20000
Epoch 1/5 - 1.42s - loss: 147.28938 - acc: 0.82757413 - batches: 313
Quality on validation dataset (20.0%), validation examples = 5000
time to finish evaluation: 0.06s
label        tp	 fp	 fn	 prec	 rec	 f1
1            1968	 263	 519	 0.88211566	 0.79131484	 0.8342518
0            2250	 519	 263	 0.8125677	 0.8953442	 0.85195
tp: 4218 fp: 782 fn: 782 labels: 2
Macro-average	 prec: 0.84734166, rec: 0.84332955, f1: 0.8453309
Micro-average	 prec: 0.8436, recall: 0.8436, f1: 0.8436
Quality on test dataset: 
time to finish evaluation: 0.24s
label        tp	 fp	 fn	 prec	 rec	 f1
1            9922	 1415	 2578	 0.87518746	 0.79376	 0.83248734
0            11085	 2578	 1415	 0.81131524	 0.8868	 0.8473799
tp: 21007 fp: 3993 fn: 3993 labels: 2
Macro-average	 prec: 0.84325135, rec: 0.84028, f1: 0.8417631
Micro-average	 prec: 0.84028, recall: 0.84028, f1: 0.84028
Epoch 2/5 - 1.38s - loss: 134.46562 - a

# How to use already trained SentimentDL pipeline or its model

We have two ways of using what we already trained: pipeline or model.

Let's see how we can save the entire pipeline, load it, and do some prediction with that pre-trained pipeline.

## Save and load pre-trained SentimentDL pipeline

In [None]:
# Google Colab is free so it comes with a little memory. 
# It's not possible to save and load in this notebook. But you can do this locally or in a decent machine!

# pipelineModel.save("./sentimentdl_pipeline")
# loadedPipeline = PipelineModel.load("./sentimentdl_pipeline")
# loadedPipeline.transform(YOUR_DATAFRAME)

# Save and load pre-trained SentimentDL model

In [None]:
# hdfs:/ if you are saving it on distributed file systems in Hadoop
pipelineModel.stages[-1].write().overwrite().save('./tmp_sentimentdl_model')

Let's use our pre-trained SentimentDLModel in a pipeline: 

In [None]:

# In a new pipeline you can load it for prediction
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

sentimentdl = SentimentDLModel.load("./tmp_sentimentdl_model") \
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")

pipeline = Pipeline(
    stages = [
        document,
        use,
        sentimentdl
    ])


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Now let's load it back so we can have prediction all together with everything in that pipeline:

In [None]:
from pyspark.sql.types import StringType

dfTest = spark.createDataFrame([
    "This movie is a delight for those of all ages. I have seen it several times and each time I am enchanted by the characters and magic. The cast is outstanding, the special effects delightful, everything most believable.",
    "This film was to put it simply rubbish. The child actors couldn't act, as can be seen by Harry's supposed surprise on learning he's a wizard. I'm a wizard! is said with such indifference you'd think he's not surprised at all."
], StringType()).toDF("text")

In [None]:
prediction = pipeline.fit(dfTest).transform(dfTest)

In [None]:
prediction.select("class.result").show()

prediction.select("class.metadata").show(truncate=False)

+----------+
|    result|
+----------+
|[positive]|
|[negative]|
+----------+

+-------------------------------------------------------------------+
|metadata                                                           |
+-------------------------------------------------------------------+
|[{sentence -> 0, positive -> 1.0, negative -> 2.8575936E-8}]       |
|[{sentence -> 0, positive -> 1.2174318E-5, negative -> 0.99998784}]|
+-------------------------------------------------------------------+

