![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/18.02.SentimentDL.ipynb)

# **SentimentDL**

This notebook will cover the different parameters and usages of `SentimentDL`. This annotator provides the ability to make multi-class sentiment analysis. 


**📖 Learning Objectives:**

1. Understand how `SentimentDL` algorithm works to determine the sentiment or opinion expressed in unstructured text data.

2. Understand how `SentimentDL` follows an unsupervised approach to identify and extract subjective information from text, such as opinions, attitudes, and emotions.

3. Become comfortable using the different parameters of the annotator, which have substantial influence on the performance of the model.


**🔗 Helpful Links:**

- Documentation : [SentimentDL](https://nlp.johnsnowlabs.com/docs/en/annotators#sentimentdl)

- Python Docs : [SentimentDLApproach](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/classifier_dl/sentiment_dl/index.html#sparknlp.annotator.classifier_dl.sentiment_dl.SentimentDLApproach)

- Scala Docs : [SentimentDLApproach](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/classifier/dl/SentimentDLApproach.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb).


## **📜 Background**

`Sentiment Analysis` is a sub-field of NLP that measures the inclination of people’s opinions within the unstructured text.

[Sentiment Analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.


The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. 

In this notebook, the aim is to perform sentiment analysis of movie watchers by using their reviews using the deep learning-based `SentimentDL` annotator.



## **🎬 Colab Setup**

In [None]:
! pip install -q pyspark==3.1.2  spark-nlp==4.2.4

In [None]:
import sparknlp

import sys
sys.path.append('../../')

import sparknlp
import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.functions import array_contains
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.2.4
Apache Spark version: 3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `SENTENCE EMBEDDINGS`

- Output: `CATEGORY`

# **💻 SentimentDLModel**

**SentimentDLModel** is an annotator for multi-class sentiment analysis.

## **🔎 Parameters**

- `setConfigProtoBytes`: Sets configProto from tensorflow, serialized into byte array.

- `setThreshold`: (float) Sets the minimum threshold for the final result otherwise it will be neutral (`Default`: 0.6).

- `setThresholdLabel`: (str) Sets what the label should be, if the score is less than threshold by `default` "neutral". 



### **Using a 💎 Model From the John Snow Labs Models Hub**

Instead of training, saving, loading and getting predictions from a model, it is possible to use a model from the John Snow Labs Models Hub.

The model's name is [Sentiment Analysis of Tweets](https://nlp.johnsnowlabs.com/2021/01/18/sentimentdl_use_twitter_en.html). This model automatically classifies sentiment in tweets as negative or positive using Universal Sentence Encoder embeddings.


In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained("tfhub_use", "en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")


sentimentdl = SentimentDLModel.pretrained("sentimentdl_use_twitter", "en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[OK!]


In [None]:
text_list = [
            """@Mbjthegreat i really dont want AT&amp;T phone service..they suck when it comes to having a signal""",
            """holy crap. I take a nap for 4 hours and Pitchfork blows up my twitter dashboard. I wish I was at Coachella.""",
            """@Susy412 he is working today  ive tried that still not working..... hmmmm!! im rubbish with computers haha!""",
            """Brand New Canon EOS 50D 15MP DSLR Camera Canon 17-85mm IS Lens ...: Web Technology Thread, Brand New Canon EOS 5.. http://u.mavrev.com/5a3t""",
            """Watching a programme about the life of Hitler, its only enhancing my geekiness of history.""",
            """GM says expects announcment on sale of Hummer soon - Reuters: WDSUGM says expects announcment on sale of Hummer .. http://bit.ly/4E1Fv""",
            """@accannis @edog1203 Great Stanford course. Thanks for making it available to the public! Really helpful and informative for starting off!""",
            """@the_real_usher LeBron is cool.  I like his personality...he has good character.""",
            """@sketchbug Lebron is a hometown hero to me, lol I love the Lakers but let's go Cavs, lol""",
            """@PDubyaD right!!! LOL we'll get there!! I have high expectations, Warren Buffet style.""",
            ]

In [None]:
empty_df = spark.createDataFrame([['']]).toDF("text")

pipelineModel = nlpPipeline.fit(empty_df)

df = spark.createDataFrame(pd.DataFrame({"text":text_list}))
result = pipelineModel.transform(df)

In [None]:
result.select(F.explode(F.arrays_zip(result.document.result, result.sentiment.result)).alias("cols")) \
.select(F.expr("cols['0']").alias("document"),
        F.expr("cols['1']").alias("sentiment")).show(truncate=150)

+-------------------------------------------------------------------------------------------------------------------------------------------+---------+
|                                                                                                                                   document|sentiment|
+-------------------------------------------------------------------------------------------------------------------------------------------+---------+
|                                         @Mbjthegreat i really dont want AT&amp;T phone service..they suck when it comes to having a signal| negative|
|                                holy crap. I take a nap for 4 hours and Pitchfork blows up my twitter dashboard. I wish I was at Coachella.| negative|
|                                @Susy412 he is working today  ive tried that still not working..... hmmmm!! im rubbish with computers haha!| negative|
|Brand New Canon EOS 50D 15MP DSLR Camera Canon 17-85mm IS Lens ...: Web Technology Thre

In [None]:
+-------------------------------------------------------------------------------------------------------------------------------------------+---------+
|                                                                                                                                   document|sentiment|
+-------------------------------------------------------------------------------------------------------------------------------------------+---------+
|                                         @Mbjthegreat i really dont want AT&amp;T phone service..they suck when it comes to having a signal| negative|
|                                holy crap. I take a nap for 4 hours and Pitchfork blows up my twitter dashboard. I wish I was at Coachella.| negative|
|                                @Susy412 he is working today  ive tried that still not working..... hmmmm!! im rubbish with computers haha!| negative|
|Brand New Canon EOS 50D 15MP DSLR Camera Canon 17-85mm IS Lens ...: Web Technology Thread, Brand New Canon EOS 5.. http://u.mavrev.com/5a3t|  neutral|
|                                                 Watching a programme about the life of Hitler, its only enhancing my geekiness of history.|  neutral|
|     GM says expects announcment on sale of Hummer soon - Reuters: WDSUGM says expects announcment on sale of Hummer .. http://bit.ly/4E1Fv| negative|
|  @accannis @edog1203 Great Stanford course. Thanks for making it available to the public! Really helpful and informative for starting off!| positive|
|                                                           @the_real_usher LeBron is cool.  I like his personality...he has good character.| positive|
|                                                   @sketchbug Lebron is a hometown hero to me, lol I love the Lakers but let's go Cavs, lol| positive|
|                                                     @PDubyaD right!!! LOL we'll get there!! I have high expectations, Warren Buffet style.| positive|
+-------------------------------------------------------------------------------------------------------------------------------------------+---------+

### `setThreshold`


This parameter sets the minimum threshold for the final result - use 1.0 instead of the default value of 0.6.

Increasing the threshold can reduce the complexity of the model by removing neurons with low activation values, which can help reduce overfitting and improve the accuracy of the model. However, it can also lead to a decrease in the model's ability to generalize, as important features may be removed.

In [None]:
sentimentdl = SentimentDLModel.pretrained("sentimentdl_use_twitter", "en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")\
    .setThreshold(1.0)

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])

sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[OK!]


In [None]:
pipelineModel = nlpPipeline.fit(empty_df)

df = spark.createDataFrame(pd.DataFrame({"text":text_list}))
result = pipelineModel.transform(df)

In [None]:
result.select(F.explode(F.arrays_zip(result.document.result, result.sentiment.result)).alias("cols")) \
.select(F.expr("cols['0']").alias("document"),
        F.expr("cols['1']").alias("sentiment")).show(truncate=150)

+-------------------------------------------------------------------------------------------------------------------------------------------+---------+
|                                                                                                                                   document|sentiment|
+-------------------------------------------------------------------------------------------------------------------------------------------+---------+
|                                         @Mbjthegreat i really dont want AT&amp;T phone service..they suck when it comes to having a signal| negative|
|                                holy crap. I take a nap for 4 hours and Pitchfork blows up my twitter dashboard. I wish I was at Coachella.| negative|
|                                @Susy412 he is working today  ive tried that still not working..... hmmmm!! im rubbish with computers haha!| negative|
|Brand New Canon EOS 50D 15MP DSLR Camera Canon 17-85mm IS Lens ...: Web Technology Thre

Changing the threshold value from the default value of **0.6** to **1.0** changed two of the predictions.



#### Parameter **getters**

In [None]:
sentimentdl.getClasses()

['positive', 'negative']

In [None]:
sentimentdl.getEngine()

'tensorflow'

In [None]:
sentimentdl.getThreshold()

1.0

In [None]:
sentimentdl.getThresholdLabel()

'neutral'

# **💻 SentimentDLApproach**

Using `SentimentDLApproach`, we will train a model. 


## **🔎 Parameters**

- `setBatchSize`: (int) number of samples used in one iteration of training  (Default: `64`).

- `setLr`: (float) controls the step size taken by the optimizer when updating the weights of the model (Default: `5e-3`)

- `setThreshold`: (float) the point at which the model makes a prediction (Default: `0.6`).

- `setThresholdLabel`: (str)  Sets what the label should be, if the score is less than threshold by default "neutral". 

- `setMaxEpochs`: (int) the number of times the model is trained on the entire dataset (Default; `10`). 

- `setValidationSplit`: (float) the proportion of training dataset to be validated against the model on each Epoch (`Default`: 0.0).

- `setVerbose`: (int) is a setting that determines how much information is printed out during the training process (`Default`: 0).

- `setDropout`: (float) defines the dropout coefficient (`Default`: 0.5). 

- `setEnableOutputLogs`: (Boolean) sets whether to output to annotators log folder (`Default`: False). 

- `setOutputLogsPath`: (str) defines the folder path to save training logs (`Default`: ""). 

- `setEvaluationLogExtended`: (Boolean) sets whether logs for validation to be extended (`Default`: False). 

- `setLabelColumn`: (str) defines the column with label per each document.

- `setRandomSeed`: (int) sets the random seed for shuffling the dataset.

- `setTestDataset`: defines the path to a parquet file of a test dataset.



### Load the **Training** and **Testing Datasets**

In [None]:
!wget -q aclimdb_train.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_train.csv
!wget -q aclimdb_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/aclimdb/aclimdb_test.csv   

In [None]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("aclimdb_train.csv")

trainDataset.show(5, truncate=150)

+------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|                                                                                                                                                  text|   label|
+------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|This is an Excellent little movie! The acting is good and the music is fantastic!! Play it on a 5-1 sound system and enjoy! It will never win any a...|positive|
|The Sarah Silverman program is very similar to Sarah's own stand up; It's so over the top with prejudice that you can't possibly take it to heart. ...|positive|
|"Prom Night" is a title-only remake of the 1980 slasher flick that starred Jamie Lee Curtis and Leslie Nielsen. This movie takes place in an Oregon...|negative|
|So often a band will get to

In [None]:
trainDataset.groupBy('label').count().show()

+--------+-----+
|   label|count|
+--------+-----+
|positive|12500|
|negative|12500|
+--------+-----+



In [None]:
testDataset = spark.read \
      .option("header", True) \
      .csv("aclimdb_test.csv")

testDataset.show(5, truncate = 150)

+------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|                                                                                                                                                  text|   label|
+------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|The Second Woman is about the story of a mysterious man who lost his wife in an accident and now believes that someone wants to do him harm. A girl...|negative|
|In my opinion the directing, editing, lighting, and acting(minus Franco) were very good. I must admit, I was pleasantly surprised and impressed wit...|positive|
|I am listening to Istanbul, intent, my eyes closed: At first there is a gentle breeze And the leaves on the trees Softly sway; Out there, far away,...|positive|
|Before I speak my piece, I 

In [None]:
testDataset.groupBy('label').count().show()

+--------+-----+
|   label|count|
+--------+-----+
|positive|12500|
|negative|12500|
+--------+-----+



[Universal Sentence Encoders](https://nlp.johnsnowlabs.com/docs/en/transformers#universalsentenceencoder) are used for training the model.


In [None]:
%%time
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

sentimentdl = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setMaxEpochs(5) \
    .setEnableOutputLogs(True)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        sentimentdl
      ]
    )

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
CPU times: user 32.1 ms, sys: 8.89 ms, total: 41 ms
Wall time: 5 s


In [None]:
%%time
pipelineModel = pipeline.fit(trainDataset)

CPU times: user 253 ms, sys: 30.6 ms, total: 283 ms
Wall time: 43.3 s


In [None]:
!cd ~/annotator_logs && ls -l

total 4
-rw-r--r-- 1 root root 440 Jan 28 18:14 SentimentDLApproach_13209f895f6a.log


In [None]:
!cat ~/annotator_logs/SentimentDLApproach_13209f895f6a.log

Training started - epochs: 5 - learning_rate: 0.005 - batch_size: 64 - training_examples: 25000
Epoch 0/5 - 4.53s - loss: 186.62079 - acc: 0.83087337 - batches: 391
Epoch 1/5 - 4.24s - loss: 174.04716 - acc: 0.85565704 - batches: 391
Epoch 2/5 - 4.28s - loss: 171.54402 - acc: 0.86230767 - batches: 391
Epoch 3/5 - 4.25s - loss: 170.19382 - acc: 0.8666587 - batches: 391
Epoch 4/5 - 4.20s - loss: 169.10829 - acc: 0.87136215 - batches: 391


In [None]:
result = pipelineModel.transform(testDataset)

In [None]:
result_df = result.select('text','label',"sentiment.result").toPandas()
result_df.head(5)

Unnamed: 0,text,label,result
0,The Second Woman is about the story of a myste...,negative,[negative]
1,"In my opinion the directing, editing, lighting...",positive,[positive]
2,"I am listening to Istanbul, intent, my eyes cl...",positive,[positive]
3,"Before I speak my piece, I would like to make ...",negative,[neutral]
4,ManBearPig is a pretty funny episode of South ...,positive,[positive]


In [None]:
sentimentdl.extractParamMap()

{Param(parent='SentimentDLApproach_c7f0ef4d699e', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='SentimentDLApproach_c7f0ef4d699e', name='maxEpochs', doc='Maximum number of epochs to train'): 5,
 Param(parent='SentimentDLApproach_c7f0ef4d699e', name='lr', doc='Learning Rate'): 0.005,
 Param(parent='SentimentDLApproach_c7f0ef4d699e', name='batchSize', doc='Batch size'): 64,
 Param(parent='SentimentDLApproach_c7f0ef4d699e', name='dropout', doc='Dropout coefficient'): 0.5,
 Param(parent='SentimentDLApproach_c7f0ef4d699e', name='enableOutputLogs', doc='Whether to use stdout in addition to Spark logs.'): True,
 Param(parent='SentimentDLApproach_c7f0ef4d699e', name='evaluationLogExtended', doc='Whether logs for validation to be extended: it displays time and evaluation of each label. Default is False.'): False,
 Param(parent='SentimentDLApproach_c7f0ef4d699e', name='threshold', doc='The minimum threshold for the final result


### Influence of parameters on the trained model.






`setMaxEpochs`

Increasing the number of epochs can improve the accuracy of the model by allowing it to learn more from the training data. However, it can also lead to overfitting if the model is exposed to the training data for too long.

We will use `setLabelColumn` to define the label column of the dataset and use `setEnableOutputLogs` to get the logs of the training process. Those two parameters have no effect on accuracy.

In [None]:
sentimentdl = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setMaxEpochs(8) \
    .setEnableOutputLogs(True)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        sentimentdl
      ]
    )

In [None]:
%%time
pipelineModel = pipeline.fit(trainDataset)

CPU times: user 325 ms, sys: 37.1 ms, total: 362 ms
Wall time: 1min 2s


In [None]:
!cd ~/annotator_logs && ls -l

total 8
-rw-r--r-- 1 root root 440 Jan 28 18:14 SentimentDLApproach_13209f895f6a.log
-rw-r--r-- 1 root root 642 Jan 28 18:27 SentimentDLApproach_77c5f5edb9dc.log


In [None]:
!cat ~/annotator_logs/SentimentDLApproach_77c5f5edb9dc.log

Training started - epochs: 8 - learning_rate: 0.005 - batch_size: 64 - training_examples: 25000
Epoch 0/8 - 4.19s - loss: 192.36795 - acc: 0.8332292 - batches: 391
Epoch 1/8 - 3.99s - loss: 187.78656 - acc: 0.85705924 - batches: 391
Epoch 2/8 - 3.97s - loss: 182.50114 - acc: 0.8627324 - batches: 391
Epoch 3/8 - 4.12s - loss: 177.98988 - acc: 0.8693029 - batches: 391
Epoch 4/8 - 3.99s - loss: 175.29518 - acc: 0.87409455 - batches: 391
Epoch 5/8 - 4.01s - loss: 173.35866 - acc: 0.8774199 - batches: 391
Epoch 6/8 - 4.06s - loss: 171.8097 - acc: 0.88166666 - batches: 391
Epoch 7/8 - 4.11s - loss: 170.92455 - acc: 0.8848317 - batches: 391


In [None]:
result = pipelineModel.transform(testDataset)

result_df = result.select('text','label',"sentiment.result").toPandas()
result_df.head(5)

Unnamed: 0,text,label,result
0,The Second Woman is about the story of a myste...,negative,[negative]
1,"In my opinion the directing, editing, lighting...",positive,[positive]
2,"I am listening to Istanbul, intent, my eyes cl...",positive,[positive]
3,"Before I speak my piece, I would like to make ...",negative,[negative]
4,ManBearPig is a pretty funny episode of South ...,positive,[positive]


Increasing the number of epochs (from **5** to **8**) increased the accuracy from the previous **0.8713** to **0.8848**, with a slight change in the training time.

`setBatchSize`

The batch size in deep learning refers to the number of samples used in one iteration of update weights during the training process. The batch size can affect the training speed, memory usage, and generalization performance of the model.

A larger batch size can lead to faster training times, as the model can make use of GPU parallelism. However, a larger batch size also requires more memory to store the activations and gradients.

A smaller batch size can provide a more accurate gradient update but can cause the model to converge more slowly because it has to be updated more frequently. 

In [None]:
sentimentdl = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setMaxEpochs(5) \
    .setEnableOutputLogs(True)\
    .setBatchSize(8)
    
pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        sentimentdl
      ]
    )

In [None]:
%%time
pipelineModel = pipeline.fit(trainDataset)

CPU times: user 611 ms, sys: 59.3 ms, total: 670 ms
Wall time: 1min 53s


In [None]:
!cd ~/annotator_logs && ls -l

total 16
-rw-r--r-- 1 root root 440 Jan 28 18:14 SentimentDLApproach_13209f895f6a.log
-rw-r--r-- 1 root root 642 Jan 28 18:27 SentimentDLApproach_77c5f5edb9dc.log
-rw-r--r-- 1 root root 435 Jan 28 18:34 SentimentDLApproach_c960022d8e24.log
-rw-r--r-- 1 root root 434 Jan 28 18:29 SentimentDLApproach_f9200d5669ea.log


In [None]:
!cat ~/annotator_logs/SentimentDLApproach_c960022d8e24.log

Training started - epochs: 5 - learning_rate: 0.005 - batch_size: 8 - training_examples: 25000
Epoch 0/5 - 16.89s - loss: 1483.6069 - acc: 0.83268 - batches: 3125
Epoch 1/5 - 16.43s - loss: 1420.7325 - acc: 0.85416 - batches: 3125
Epoch 2/5 - 16.71s - loss: 1400.2687 - acc: 0.86344 - batches: 3125
Epoch 3/5 - 16.59s - loss: 1384.9261 - acc: 0.87016 - batches: 3125
Epoch 4/5 - 16.46s - loss: 1368.1877 - acc: 0.87704 - batches: 3125


In [None]:
result = pipelineModel.transform(testDataset)

result_df = result.select('text','label',"sentiment.result").toPandas()
result_df.head(5)

Unnamed: 0,text,label,result
0,The Second Woman is about the story of a myste...,negative,[negative]
1,"In my opinion the directing, editing, lighting...",positive,[positive]
2,"I am listening to Istanbul, intent, my eyes cl...",positive,[positive]
3,"Before I speak my piece, I would like to make ...",negative,[negative]
4,ManBearPig is a pretty funny episode of South ...,positive,[positive]


Decreasing the batch size to **8** (default value was **64**)  almost doubled the training time, but there was only a slight increase in accuracy (from **0.8713** to **0.8770**).

<br/>

`setLr`

The learning rate in deep learning controls the step size at which the optimizer makes updates to the model parameters. 

A small learning rate may result in slow convergence and a high learning rate may cause the optimizer to overshoot the optimal solution. 

It is important to find a good learning rate as it can greatly affect the performance of the model. 

If the learning rate is too low, the model will take a long time to converge and if it is too high, it may never converge. 

<br/>

In [None]:
sentimentdl = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setMaxEpochs(5) \
    .setEnableOutputLogs(True)\
    .setBatchSize(64)\
    .setLr(0.001)
    
pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        sentimentdl
      ]
    )


In [None]:
%%time
pipelineModel = pipeline.fit(trainDataset)

CPU times: user 277 ms, sys: 36.1 ms, total: 313 ms
Wall time: 51.2 s


In [None]:
!cd ~/annotator_logs && ls -l

total 20
-rw-r--r-- 1 root root 437 Jan 28 18:41 SentimentDLApproach_0db56013853a.log
-rw-r--r-- 1 root root 440 Jan 28 18:14 SentimentDLApproach_13209f895f6a.log
-rw-r--r-- 1 root root 642 Jan 28 18:27 SentimentDLApproach_77c5f5edb9dc.log
-rw-r--r-- 1 root root 435 Jan 28 18:34 SentimentDLApproach_c960022d8e24.log
-rw-r--r-- 1 root root 434 Jan 28 18:29 SentimentDLApproach_f9200d5669ea.log


In [None]:
!cat ~/annotator_logs/SentimentDLApproach_0db56013853a.log

Training started - epochs: 5 - learning_rate: 0.001 - batch_size: 64 - training_examples: 25000
Epoch 0/5 - 4.21s - loss: 191.12515 - acc: 0.83260417 - batches: 391
Epoch 1/5 - 4.25s - loss: 185.34619 - acc: 0.8533574 - batches: 391
Epoch 2/5 - 4.15s - loss: 184.97502 - acc: 0.8587661 - batches: 391
Epoch 3/5 - 4.02s - loss: 184.43819 - acc: 0.8611699 - batches: 391
Epoch 4/5 - 4.22s - loss: 183.95926 - acc: 0.8627324 - batches: 391


In [None]:
result = pipelineModel.transform(testDataset)

result_df = result.select('text','label',"sentiment.result").toPandas()
result_df.head(5)

Unnamed: 0,text,label,result
0,The Second Woman is about the story of a myste...,negative,[negative]
1,"In my opinion the directing, editing, lighting...",positive,[positive]
2,"I am listening to Istanbul, intent, my eyes cl...",positive,[positive]
3,"Before I speak my piece, I would like to make ...",negative,[positive]
4,ManBearPig is a pretty funny episode of South ...,positive,[positive]


Decreasing the Learning Rate caused a drop in the accuracy, also shown in a wrong prediction.

<br/>

`setDropout`

The dropout technique is a regularization method used in deep learning to reduce overfitting. 

It works by randomly dropping out neurons from the neural network during training, which forces the network to learn more robust features that are useful in making predictions on new data.

In [None]:
sentimentdl = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setMaxEpochs(5) \
    .setEnableOutputLogs(True)\
    .setBatchSize(64)\
    .setLr(0.005)\
    .setDropout(0.2)
    
pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        sentimentdl
      ]
    )

In [None]:
%%time
pipelineModel = pipeline.fit(trainDataset)

CPU times: user 285 ms, sys: 36 ms, total: 321 ms
Wall time: 53 s


In [None]:
!cd ~/annotator_logs && ls -l

total 24
-rw-r--r-- 1 root root 437 Jan 28 18:41 SentimentDLApproach_0db56013853a.log
-rw-r--r-- 1 root root 440 Jan 28 18:14 SentimentDLApproach_13209f895f6a.log
-rw-r--r-- 1 root root 437 Jan 28 18:54 SentimentDLApproach_3c9236ac88d5.log
-rw-r--r-- 1 root root 642 Jan 28 18:27 SentimentDLApproach_77c5f5edb9dc.log
-rw-r--r-- 1 root root 435 Jan 28 18:34 SentimentDLApproach_c960022d8e24.log
-rw-r--r-- 1 root root 434 Jan 28 18:29 SentimentDLApproach_f9200d5669ea.log


In [None]:
!cat ~/annotator_logs/SentimentDLApproach_3c9236ac88d5.log

Training started - epochs: 5 - learning_rate: 0.005 - batch_size: 64 - training_examples: 25000
Epoch 0/5 - 4.47s - loss: 179.51999 - acc: 0.8339423 - batches: 391
Epoch 1/5 - 4.47s - loss: 173.24928 - acc: 0.85473555 - batches: 391
Epoch 2/5 - 4.31s - loss: 170.82431 - acc: 0.86222756 - batches: 391
Epoch 3/5 - 4.47s - loss: 168.52895 - acc: 0.8676202 - batches: 391
Epoch 4/5 - 4.28s - loss: 166.0777 - acc: 0.8726683 - batches: 391


In [None]:
result = pipelineModel.transform(testDataset)

result_df = result.select('text','label',"sentiment.result").toPandas()
result_df.head(5)

Unnamed: 0,text,label,result
0,The Second Woman is about the story of a myste...,negative,[negative]
1,"In my opinion the directing, editing, lighting...",positive,[positive]
2,"I am listening to Istanbul, intent, my eyes cl...",positive,[positive]
3,"Before I speak my piece, I would like to make ...",negative,[positive]
4,ManBearPig is a pretty funny episode of South ...,positive,[positive]


Decreasing the Dropout caused a slight increase in the accuracy value.

<br/>

`setThreshold`

Threshold is used in deep learning to determine when a neuron should be activated. It is used to control the sensitivity of the neurons and can be adjusted to optimize the performance of the neural network.

<br/>


In [None]:
sentimentdl = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setMaxEpochs(5) \
    .setEnableOutputLogs(True)\
    .setBatchSize(64)\
    .setLr(0.005)\
    .setDropout(0.5)\
    .setThreshold(0.2)
    
pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        sentimentdl
      ]
    )

In [None]:
%%time
pipelineModel = pipeline.fit(trainDataset)

CPU times: user 278 ms, sys: 46.7 ms, total: 325 ms
Wall time: 52 s


In [None]:
!cd ~/annotator_logs && ls -l

total 28
-rw-r--r-- 1 root root 437 Jan 28 18:41 SentimentDLApproach_0db56013853a.log
-rw-r--r-- 1 root root 440 Jan 28 18:14 SentimentDLApproach_13209f895f6a.log
-rw-r--r-- 1 root root 437 Jan 28 18:54 SentimentDLApproach_3c9236ac88d5.log
-rw-r--r-- 1 root root 642 Jan 28 18:27 SentimentDLApproach_77c5f5edb9dc.log
-rw-r--r-- 1 root root 435 Jan 28 18:34 SentimentDLApproach_c960022d8e24.log
-rw-r--r-- 1 root root 436 Jan 28 19:26 SentimentDLApproach_e3e1e8ada7bd.log
-rw-r--r-- 1 root root 434 Jan 28 18:29 SentimentDLApproach_f9200d5669ea.log


In [None]:
!cat ~/annotator_logs/SentimentDLApproach_e3e1e8ada7bd.log

Training started - epochs: 5 - learning_rate: 0.005 - batch_size: 64 - training_examples: 25000
Epoch 0/5 - 4.40s - loss: 183.00195 - acc: 0.8320993 - batches: 391
Epoch 1/5 - 4.15s - loss: 180.9149 - acc: 0.85615385 - batches: 391
Epoch 2/5 - 4.24s - loss: 177.15211 - acc: 0.8619872 - batches: 391
Epoch 3/5 - 4.15s - loss: 174.10666 - acc: 0.86615384 - batches: 391
Epoch 4/5 - 4.18s - loss: 172.1392 - acc: 0.8695593 - batches: 391


In [None]:
result = pipelineModel.transform(testDataset)

result_df = result.select('text','label',"sentiment.result").toPandas()
result_df.head(5)

Unnamed: 0,text,label,result
0,The Second Woman is about the story of a myste...,negative,[negative]
1,"In my opinion the directing, editing, lighting...",positive,[positive]
2,"I am listening to Istanbul, intent, my eyes cl...",positive,[positive]
3,"Before I speak my piece, I would like to make ...",negative,[positive]
4,ManBearPig is a pretty funny episode of South ...,positive,[positive]


Setting the `Threshold` far away from the default value caused a poorer training performance.

<br/>

## **💻 Retrain the Model with Different Parameters - 1**

Make changes on the major training parameters and see the cumulative effect on the model's performance.


- `setMaxEpochs`: 10
- `setLr`: 5e-3
- `setThreshold`: 0.5
- `setDropout`: 0.5
- `setBatchSize`: 16



In [None]:
%%time
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

sentimentdl = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setBatchSize(16) \
    .setMaxEpochs(10) \
    .setLr(5e-3) \
    .setThreshold(0.5)\
    .setDropout(0.5)\
    .setEnableOutputLogs(True)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        sentimentdl
      ]
    )

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
CPU times: user 36.5 ms, sys: 5.95 ms, total: 42.4 ms
Wall time: 6.04 s


In [None]:
%%time
pipelineModel = pipeline.fit(trainDataset)

CPU times: user 476 ms, sys: 54.1 ms, total: 530 ms
Wall time: 1min 26s


In [None]:
!cd ~/annotator_logs && ls -l

In [None]:
!cat ~/annotator_logs/SentimentDLApproach_56b32bc4688a.log

Training started - epochs: 10 - learning_rate: 0.005 - batch_size: 16 - training_examples: 25000
Epoch 0/10 - 6.15s - loss: 760.86707 - acc: 0.83134604 - batches: 1563
Epoch 1/10 - 5.84s - loss: 725.3715 - acc: 0.8524728 - batches: 1563
Epoch 2/10 - 5.58s - loss: 711.9842 - acc: 0.86079544 - batches: 1563
Epoch 3/10 - 5.61s - loss: 699.32196 - acc: 0.8676776 - batches: 1563
Epoch 4/10 - 6.42s - loss: 687.68933 - acc: 0.8738396 - batches: 1563
Epoch 5/10 - 5.86s - loss: 681.52374 - acc: 0.8808819 - batches: 1563
Epoch 6/10 - 6.39s - loss: 672.99677 - acc: 0.8856434 - batches: 1563
Epoch 7/10 - 5.85s - loss: 665.8134 - acc: 0.8907651 - batches: 1563
Epoch 8/10 - 5.57s - loss: 656.8764 - acc: 0.89580667 - batches: 1563
Epoch 9/10 - 5.89s - loss: 649.78217 - acc: 0.900008 - batches: 1563


In [None]:
result = pipelineModel.transform(testDataset)

In [None]:
result_df = result.select('text','label',"sentiment.result").toPandas()
result_df.head(20)

Unnamed: 0,text,label,result
0,The Second Woman is about the story of a myste...,negative,[negative]
1,"In my opinion the directing, editing, lighting...",positive,[positive]
2,"I am listening to Istanbul, intent, my eyes cl...",positive,[positive]
3,"Before I speak my piece, I would like to make ...",negative,[positive]
4,ManBearPig is a pretty funny episode of South ...,positive,[positive]
5,A buddy and I went to see this movie when it c...,negative,[negative]
6,It is incredible that there were two films wit...,negative,[negative]
7,Dire! Dismal! Awful! Laughable! Disappointing!...,negative,[negative]
8,"HLOTS was an outstanding series, its what NYPD...",positive,[positive]
9,This is just one of those films which cannot j...,negative,[negative]


Training by using different parameters ended up different results in rows 3, 11, 13.

## **💻 Retrain the Model with Different Parameters - 2**


- `setMaxEpochs`: 20
- `setLr`: 5e-3
- `setThreshold`: 0.5
- `setDropout`: 0.5
- `setBatchSize`: 16

In [None]:
sentimentdl = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setBatchSize(16) \
    .setMaxEpochs(20) \
    .setLr(5e-3) \
    .setThreshold(0.5)\
    .setDropout(0.5)\
    .setEnableOutputLogs(True)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        sentimentdl
      ]
    )

In [None]:
%%time
pipelineModel = pipeline.fit(trainDataset)

CPU times: user 1.37 s, sys: 152 ms, total: 1.52 s
Wall time: 3min 54s


In [None]:
!cd ~/annotator_logs && ls -l

total 4
-rw-r--r-- 1 root root 1497 Jan  9 16:53 SentimentDLApproach_1d3cc6c71f23.log


In [None]:
!cat ~/annotator_logs/SentimentDLApproach_1d3cc6c71f23.log

Training started - epochs: 20 - learning_rate: 0.005 - batch_size: 16 - training_examples: 25000
Epoch 0/20 - 9.94s - loss: 729.8246 - acc: 0.8346671 - batches: 1563
Epoch 1/20 - 9.37s - loss: 701.936 - acc: 0.8565541 - batches: 1563
Epoch 2/20 - 9.15s - loss: 691.77997 - acc: 0.8628761 - batches: 1563
Epoch 3/20 - 9.21s - loss: 683.1144 - acc: 0.8693582 - batches: 1563
Epoch 4/20 - 9.17s - loss: 674.8982 - acc: 0.8768806 - batches: 1563
Epoch 5/20 - 9.32s - loss: 667.14655 - acc: 0.88308257 - batches: 1563
Epoch 6/20 - 9.18s - loss: 659.563 - acc: 0.8894046 - batches: 1563
Epoch 7/20 - 9.17s - loss: 653.2873 - acc: 0.893886 - batches: 1563
Epoch 8/20 - 9.16s - loss: 645.3135 - acc: 0.89788735 - batches: 1563
Epoch 9/20 - 9.12s - loss: 639.85455 - acc: 0.9024488 - batches: 1563
Epoch 10/20 - 9.14s - loss: 635.6452 - acc: 0.9040493 - batches: 1563
Epoch 11/20 - 9.17s - loss: 631.8371 - acc: 0.9068502 - batches: 1563
Epoch 12/20 - 9.15s - loss: 628.44604 - acc: 0.90865076 - batches: 1563

In [None]:
result = pipelineModel.transform(testDataset)

In [None]:
result_df = result.select('text','label',"sentiment.result").toPandas()
result_df.head(20)

Unnamed: 0,text,label,result
0,The Second Woman is about the story of a myste...,negative,[negative]
1,"In my opinion the directing, editing, lighting...",positive,[positive]
2,"I am listening to Istanbul, intent, my eyes cl...",positive,[positive]
3,"Before I speak my piece, I would like to make ...",negative,[positive]
4,ManBearPig is a pretty funny episode of South ...,positive,[positive]
5,A buddy and I went to see this movie when it c...,negative,[negative]
6,It is incredible that there were two films wit...,negative,[negative]
7,Dire! Dismal! Awful! Laughable! Disappointing!...,negative,[negative]
8,"HLOTS was an outstanding series, its what NYPD...",positive,[positive]
9,This is just one of those films which cannot j...,negative,[negative]


Final training dropped the wrong predictions to only 1, row # 13.

### 💾 **Saving & Loading Back the Trained Model**

In [None]:
pipelineModel.stages

[DocumentAssembler_eb98866d1189,
 UNIVERSAL_SENTENCE_ENCODER_4de71669b7ec,
 SentimentDLModel_f8eda8b46edb]

In [None]:
# Save the Model
pipelineModel.stages[2].write().overwrite().save('SentimentDL_use')

In [None]:
# Load back the saved Model
SentimentClfModel = SentimentDLModel.load('SentimentDL_use')

In [None]:
# Generate prediction Pipeline with loaded Model 
ld_pipeline = Pipeline(stages=[documentAssembler, useEmbeddings, SentimentClfModel])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("text"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(testDataset)

In [None]:
ld_preds_df = ld_preds.select('text','label',"sentiment.result").toPandas()

In [None]:
ld_preds_df.head(20)

Unnamed: 0,text,label,result
0,The Second Woman is about the story of a myste...,negative,[negative]
1,"In my opinion the directing, editing, lighting...",positive,[positive]
2,"I am listening to Istanbul, intent, my eyes cl...",positive,[positive]
3,"Before I speak my piece, I would like to make ...",negative,[positive]
4,ManBearPig is a pretty funny episode of South ...,positive,[positive]
5,A buddy and I went to see this movie when it c...,negative,[negative]
6,It is incredible that there were two films wit...,negative,[negative]
7,Dire! Dismal! Awful! Laughable! Disappointing!...,negative,[negative]
8,"HLOTS was an outstanding series, its what NYPD...",positive,[positive]
9,This is just one of those films which cannot j...,negative,[negative]
