![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP
### Multi-class Sentiment Classification
#### By using SentimentDL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/classification/SentimentDL_train_multiclass_sentiment_classifier.ipynb)

Only run this block if you are inside Google Colab otherwise skip it

In [1]:
import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed -q spark-nlp==2.5.1

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)
[K     |████████████████████████████████| 215.7MB 56kB/s 
[K     |████████████████████████████████| 204kB 50.1MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 122kB 2.8MB/s 
[?25h

In [5]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version", spark.version)


Spark NLP version:  2.5.1
Apache Spark version 2.4.4


Let's download IMDB movie reviews dataset for training our multi-class sentiment classifier

In [6]:
!wget -O aclimdb_train.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_train.csv

--2020-06-01 15:21:56--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_train.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.236.101
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.236.101|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33497180 (32M) [text/csv]
Saving to: ‘aclimdb_train.csv’


2020-06-01 15:21:57 (112 MB/s) - ‘aclimdb_train.csv’ saved [33497180/33497180]



In [7]:
!wget -O aclimdb_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_test.csv

--2020-06-01 15:22:18--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_test.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.86.29
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.86.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32715164 (31M) [text/csv]
Saving to: ‘aclimdb_test.csv’


2020-06-01 15:22:19 (105 MB/s) - ‘aclimdb_test.csv’ saved [32715164/32715164]



In [8]:
!head aclimdb_train.csv

text,label
"This is an Excellent little movie! The acting is good and the music is fantastic!! Play it on a 5-1 sound system and enjoy! It will never win any awards but its good clean fun for all!! I recommend this movie to all fans of pretty girls funny and hansom men as well as robot lovers everyone!!1 P.S. It also stars Lisa Rinna! Enjoy!!This is a very hard movie to find, It is out of print. I first saw it on Showtime many years ago but recently found a used VHS copy. Its still a must see for all!!!This is an Excellent little movie! The acting is good and the music is fantastic!! Play it on a 5-1 sound system and enjoy! It will never win any awards but its good clean fun for all!! I recommend this movie to all fans of pretty girls funny and hansom men as well as robot lovers everyone!!1 P.S. It also stars Lisa Rinna! Enjoy!! Dave Engle This is a very hard movie to find, It is out of print. I first saw it on Showtime many years ago but recently found a used VHS copy. Its still a mus

The content is inside `text` column and the sentiment is inside `label` column

In [None]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("aclimdb_train.csv")

In [10]:
trainDataset.show()

+--------------------+--------+
|                text|   label|
+--------------------+--------+
|This is an Excell...|positive|
|The Sarah Silverm...|positive|
|"Prom Night" is a...|negative|
|So often a band w...|positive|
|"Pet Sematary" is...|positive|
|I watched the fil...|negative|
|Boy this movie ha...|negative|
|Checking the spoi...|negative|
|Despite its rathe...|positive|
|Absolute masterpi...|positive|
|The tweedy profes...|positive|
|A movie best summ...|negative|
|Take young, prett...|negative|
|For months I've b...|negative|
|"Batman: The Myst...|positive|
|Well, it was funn...|negative|
|I have seen the s...|positive|
|Brainless film ab...|negative|
|Leave it to geniu...|negative|
|Seven Pounds star...|positive|
+--------------------+--------+
only showing top 20 rows



In [11]:
trainDataset.count()

25000

In [None]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [17]:
# actual content is inside description column
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in category column
sentimentdl = SentimentDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("label")\
  .setMaxEpochs(5)\
  .setEnableOutputLogs(True)

pipeline = Pipeline(
    stages = [
        document,
        use,
        sentimentdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
pipelineModel = pipeline.fit(trainDataset)

In [21]:
!cd ~/annotator_logs && ls -l

total 4
-rw-r--r-- 1 root root 523 Jun  1 15:30 SentimentDLApproach_2ea7dc3149c2.log


In [22]:
!cat ~/annotator_logs/SentimentDLApproach_2ea7dc3149c2.log

Training started - total epochs: 5 - learning rate: 0.005 - batch size: 64 - training examples: 25000
Epoch 0/5 - 7.261388839%.2fs - loss: 184.75143 - accuracy: 0.82907856 - batches: 391
Epoch 1/5 - 7.019650974%.2fs - loss: 174.16364 - accuracy: 0.85578525 - batches: 391
Epoch 2/5 - 6.98391997%.2fs - loss: 171.41266 - accuracy: 0.8602084 - batches: 391
Epoch 3/5 - 7.030380175%.2fs - loss: 170.09117 - accuracy: 0.86528045 - batches: 391
Epoch 4/5 - 7.01538049%.2fs - loss: 168.41052 - accuracy: 0.8704247 - batches: 391


# How to use already trained SentimentDL pipeline or its model

We have two ways of using what we already trained: pipeline or model.

Let's see how we can save the entire pipeline, load it, and do some prediction with that pre-trained pipeline.

## Save and load pre-trained SentimentDL pipeline

In [None]:
# Google Colab is free so it comes with a little memory. 
# It's not possible to save and load in this notebook. But you can do this locally or in a decent machine!

# pipelineModel.save("./sentimentdl_pipeline")
# loadedPipeline = PipelineModel.load("./sentimentdl_pipeline")
# loadedPipeline.transform(YOUR_DATAFRAME)

# Save and load pre-trained SentimentDL model

In [None]:
# hdfs:/ if you are saving it on distributed file systems in Hadoop
pipelineModel.stages[-1].write().overwrite().save('./tmp_sentimentdl_model')



Let's use our pre-trained SentimentDLModel in a pipeline: 

In [24]:

# In a new pipeline you can load it for prediction
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

sentimentdl = SentimentDLModel.load("./tmp_sentimentdl_model") \
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")

pipeline = Pipeline(
    stages = [
        document,
        use,
        sentimentdl
    ])


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Now let's load it back so we can have prediction all together with everything in that pipeline:

In [None]:
from pyspark.sql.types import StringType

dfTest = spark.createDataFrame([
    "This movie is a delight for those of all ages. I have seen it several times and each time I am enchanted by the characters and magic. The cast is outstanding, the special effects delightful, everything most believable.",
    "This film was to put it simply rubbish. The child actors couldn't act, as can be seen by Harry's supposed surprise on learning he's a wizard. I'm a wizard! is said with such indifference you'd think he's not surprised at all."
], StringType()).toDF("text")

In [None]:
prediction = pipeline.fit(dfTest).transform(dfTest)

In [30]:
prediction.select("class.result").show()

prediction.select("class.metadata").show(truncate=False)

+----------+
|    result|
+----------+
|[positive]|
|[negative]|
+----------+

+-----------------------------------------------------------------+
|metadata                                                         |
+-----------------------------------------------------------------+
|[[sentence -> 0, positive -> 1.0, negative -> 9.762569E-10]]     |
|[[sentence -> 0, positive -> 4.49094E-5, negative -> 0.99995506]]|
+-----------------------------------------------------------------+



# Evaluation 

Let's evaluatte our SentimentDL model we trained earlier, saved it, and loaded it into a new pipeline by using a test dataset that model has never seen:

In [None]:
testDataset = spark.read \
      .option("header", True) \
      .csv("aclimdb_test.csv")

In [None]:
preds = pipelineModel.transform(testDataset)

In [79]:
preds.select('label','text',"class.result").show(50, truncate=50)

+--------+--------------------------------------------------+----------+
|   label|                                              text|    result|
+--------+--------------------------------------------------+----------+
|negative|The Second Woman is about the story of a myster...|[negative]|
|positive|In my opinion the directing, editing, lighting,...|[positive]|
|positive|I am listening to Istanbul, intent, my eyes clo...|[positive]|
|negative|Before I speak my piece, I would like to make a...|[positive]|
|positive|ManBearPig is a pretty funny episode of South P...|[positive]|
|negative|A buddy and I went to see this movie when it ca...|[negative]|
|negative|It is incredible that there were two films with...|[negative]|
|negative|Dire! Dismal! Awful! Laughable! Disappointing!<...|[negative]|
|positive|HLOTS was an outstanding series, its what NYPD ...|[positive]|
|negative|This is just one of those films which cannot ju...|[negative]|
|negative|This movie had the potential to be a very

In [None]:
preds_df = preds.select('label','text',"class.result").toPandas()

In [None]:
# The result is an array since in Spark NLP you can have multiple sentences.
# This means you can add SentenceDetector in the pipeline and feed it into
# UniversalSentenceEncoder and you can have prediction based on each sentence.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

The `SentimentDL` has the ability to accept a threshold to set a label on any result that is less than that number. 

For instance, by default the threshold is set on `0.6` and everything below that will be assigned as `neutral`. You can change this label with `setThresholdLabel`.

We need to filter `neutral` results since we don't have any in the original test dataset to compare with.

In [None]:
preds_df = preds_df[preds_df['result'] != 'neutral']


In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

Let's use `classification_report` from `sklearn` to evaluate the final scores. (keep in mind due to limited resources on a free Google Colab we only used 5 Epochs :)

In [98]:
print (classification_report(preds_df['result'], preds_df['label']))

              precision    recall  f1-score   support

    negative       0.87      0.86      0.86     12449
    positive       0.85      0.87      0.86     12041

    accuracy                           0.86     24490
   macro avg       0.86      0.86      0.86     24490
weighted avg       0.86      0.86      0.86     24490

