<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

## Vivekn Sentiment Analysis

In the following example, we walk-through Sentiment Analysis training and prediction using Spark NLP Annotators.

The ViveknSentimentApproach annotator will compute [Vivek Narayanan algorithm](https://arxiv.org/pdf/1305.6143.pdf) with either a column in training dataset with rows labelled 'positive' or 'negative' or a folder full of positive text and a folder with negative text. Using n-grams and negation of sequences, this statistical model can achieve high accuracy if trained properly.

Spark can be leveraged in training by utilizing ReadAs.Dataset setting. Spark will be used during prediction by default.

We also include in this pipeline a spell checker which shall correct our sentences for better Sentiment Analysis accuracy.

#### 1. Call necessary imports and set the resource path to read local data files

In [95]:
#Imports
import time
import sys
import os
#sys.path.append('../../')

from pyspark.ml import Pipeline, PipelineModel
from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Finisher

#Setting location of resource Directory
resource_path= "../../../src/test/resources/"


#### 2. Load SparkSession if not already there

In [96]:
spark = SparkSession.builder \
    .appName("VivekNarayanSentimentApproach")\
    .master("local[*]")\
    .config("spark.driver.memory","8G")\
    .config("spark.driver.maxResultSize", "2G")\
    .config("spark.jar", "lib/sparknlp.jar")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

 #### 3. Load a spark dataset and put it in memory

In [97]:
#Load the input data to be annotated
data = spark. \
        read. \
        parquet( resource_path+"sentiment.parquet"). \
        limit(1000).cache()
data.show()

+------+---------+--------------------+
|itemid|sentiment|                text|
+------+---------+--------------------+
|     1|        0|                 ...|
|     2|        0|                 ...|
|     3|        1|              omg...|
|     4|        0|          .. Omga...|
|     5|        0|         i think ...|
|     6|        0|         or i jus...|
|     7|        1|       Juuuuuuuuu...|
|     8|        0|       Sunny Agai...|
|     9|        1|      handed in m...|
|    10|        1|      hmmmm.... i...|
|    11|        0|      I must thin...|
|    12|        1|      thanks to a...|
|    13|        0|      this weeken...|
|    14|        0|     jb isnt show...|
|    15|        0|     ok thats it ...|
|    16|        0|    &lt;-------- ...|
|    17|        0|    awhhe man.......|
|    18|        1|    Feeling stran...|
|    19|        0|    HUGE roll of ...|
|    20|        0|    I just cut my...|
+------+---------+--------------------+
only showing top 20 rows



#### 4. Create the document assembler, which will put target text column into Annotation form

In [98]:
### Define the dataframe
document_assembler = DocumentAssembler() \
            .setInputCol("text")\
            .setOutputCol("document")


In [99]:
### Example: Checkout the output of document assembler
assembled = document_assembler.transform(data)
assembled.show(5)

+------+---------+--------------------+--------------------+
|itemid|sentiment|                text|            document|
+------+---------+--------------------+--------------------+
|     1|        0|                 ...|[[document, 0, 39...|
|     2|        0|                 ...|[[document, 0, 31...|
|     3|        1|              omg...|[[document, 0, 22...|
|     4|        0|          .. Omga...|[[document, 0, 12...|
|     5|        0|         i think ...|[[document, 0, 37...|
+------+---------+--------------------+--------------------+
only showing top 5 rows



#### 5. Create Sentence detector to parse sub sentences in every document

In [100]:
### Sentence detector
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

In [101]:
### Example: Checkout the output of sentence detector
sentence_data = sentence_detector.transform(assembled)
sentence_data.show(5)

+------+---------+--------------------+--------------------+--------------------+
|itemid|sentiment|                text|            document|            sentence|
+------+---------+--------------------+--------------------+--------------------+
|     1|        0|                 ...|[[document, 0, 39...|[[document, 0, 27...|
|     2|        0|                 ...|[[document, 0, 31...|[[document, 0, 29...|
|     3|        1|              omg...|[[document, 0, 22...|[[document, 0, 22...|
|     4|        0|          .. Omga...|[[document, 0, 12...|[[document, 0, 0,...|
|     5|        0|         i think ...|[[document, 0, 37...|[[document, 0, 33...|
+------+---------+--------------------+--------------------+--------------------+
only showing top 5 rows



#### 6. The tokenizer will match standard tokens

In [102]:
### Tokenizer
tokenizer = Tokenizer() \
            .setInputCols(["sentence"]) \
            .setOutputCol("token")


In [103]:
### Example: Checkout the outout of tokenizer
tokenized = tokenizer.transform(sentence_data)
tokenized.show(5)

+------+---------+--------------------+--------------------+--------------------+--------------------+
|itemid|sentiment|                text|            document|            sentence|               token|
+------+---------+--------------------+--------------------+--------------------+--------------------+
|     1|        0|                 ...|[[document, 0, 39...|[[document, 0, 27...|[[token, 0, 1, is...|
|     2|        0|                 ...|[[document, 0, 31...|[[document, 0, 29...|[[token, 0, 0, I,...|
|     3|        1|              omg...|[[document, 0, 22...|[[document, 0, 22...|[[token, 0, 2, om...|
|     4|        0|          .. Omga...|[[document, 0, 12...|[[document, 0, 0,...|[[token, 0, 0, .,...|
|     5|        0|         i think ...|[[document, 0, 37...|[[document, 0, 33...|[[token, 0, 0, i,...|
+------+---------+--------------------+--------------------+--------------------+--------------------+
only showing top 5 rows



#### 7. Normalizer will clean out the tokens

In [104]:
normalizer = Normalizer() \
            .setInputCols(["token"]) \
            .setOutputCol("normal")

#### 8. The spell checker will correct normalized tokens, this trains with a dictionary of english words

In [105]:
### Spell Checker
spell_checker = NorvigSweetingApproach() \
            .setInputCols(["normal"]) \
            .setOutputCol("spell") \
            .setDictionary( resource_path+ "spell/words.txt")


#### 9. Create the ViveknSentimentApproach and set resources to train it

In [106]:
sentiment_detector = ViveknSentimentApproach() \
    .setInputCols(["spell", "sentence"]) \
    .setOutputCol("sentiment") \
    .setPruneCorpus(0) \
    .setPositiveSource(resource_path+"vivekn/positive") \
    .setNegativeSource(resource_path+"vivekn/negative") \


#### 10. The finisher will utilize sentiment analysis output

In [107]:
finisher = Finisher() \
    .setInputCols(["sentiment"]) \
    .setIncludeMetadata(False)


##### 11. Fit and predict over data

In [117]:
pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    normalizer,
    spell_checker,
    sentiment_detector,
    finisher
])

start = time.time()
sentiment_data = pipeline.fit(data).transform(data)

end = time.time()
print("Time elapsed pipeline process: " + str(end - start))

Time elapsed pipeline process: 4.109271049499512


##### 13. Check the result

In [109]:
# Negative Sentiments
for r in sentiment_data.filter(sentiment_data.finished_sentiment.contains('negative')).take(5):
    print(r['text'].strip(),"->",r['finished_sentiment'])

.. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)... -> na@na@na@negative
&lt;---Sad level is 3. I was writing a massive blog tweet on Myspace and my comp shut down. Now it's all lost *lays in fetal position* -> na@na@negative@na
SOX!     Floyd was great, but relievers need a scolding! -> na@negative
i was too slow to get $1 Up tix -> negative


In [110]:
# Positive Sentiments
for r in sentiment_data.filter(sentiment_data.finished_sentiment.contains('positive')).take(5):
    print(r['text'].strip(),"->",r['finished_sentiment'])

can't be bothered. i wish i could spend the rest of my life just sat here and going to gigs. seriously. -> na@positive@na
Feeeling like shit right now. I really want to sleep, but nooo I have 3 hours of dancing and an art assignment to finish. -> na@positive
i miss you guys too     i think i'm wearing skinny jeans a cute sweater and heels   not really sure   what are you doing today -> positive
(: !!!!!! - so i wrote something last week. and i got a call from someone in the new york office... http://tumblr.com/xcn21w6o7 -> na@positive@na
and the entertainment is over, someone complained properly..   @rupturerapture experimental you say? he should experiment with a melody -> positive@na@na


#### 14. Can also be used directly on an array of dummy text

In [111]:
dummy_data = spark.sparkContext.parallelize([["I am happy and like this spark NLP"], ["Have to say something bad now"]]).toDF().toDF("text")
dummy_data.show()

+--------------------+
|                text|
+--------------------+
|I am happy and li...|
|Have to say somet...|
+--------------------+



In [119]:
pipeline.fit(dummy_data).transform(dummy_data).show()

+--------------------+------------------+
|                text|finished_sentiment|
+--------------------+------------------+
|I am happy and li...|          positive|
|Have to say somet...|          negative|
+--------------------+------------------+



##### 14. The pipeline could be saved on disk for future reuse. Either after or before fitting the model

In [113]:
new_pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    normalizer,
    spell_checker,
    sentiment_detector,
    finisher
])

start = time.time()
new_pipeline.write().overwrite().save("./ps")
end = time.time()
print("Time elapsed in write before fiting pipelines: " + str(end - start))
start = time.time()
new_pipeline.fit(data).write().overwrite().save("./ms")
end = time.time()
print("Time elapsed in write after fiting pipelines: " + str(end - start))

Time elapsed in write before fiting pipelines: 0.3147120475769043
Time elapsed in write after fiting pipelines: 6.535132169723511


##### 15. Pipelines can be easily loaded back in memory 

In [114]:

start = time.time()
p = Pipeline.read().load("./ps")
pm = PipelineModel.read().load("./ms")
end = time.time()
print("Time elapsed in read pipelines: " + str(end - start))

Time elapsed in read pipelines: 5.312443494796753


In [115]:
# Using the fitted pipeline read from disk
start = time.time()
pm.transform(data).where("finished_sentiment like '%negative%'").show()
print(pm.transform(data).count())
end = time.time()
print("Time elapsed in using loaded pipelines: " + str(end - start))

+------+--------------------+------------------+
|itemid|                text|finished_sentiment|
+------+--------------------+------------------+
|     4|          .. Omga...| na@na@na@negative|
|    24|   &lt;---Sad lev...| na@na@negative@na|
|    30|   I didn't reali...|       negative@na|
|    37|   SOX!     Floyd...|       na@negative|
|    80|  i was too slow ...|          negative|
|    98|  My new car was ...|       negative@na|
|   118|  True, highly su...|    na@negative@na|
|   135| #squarespace bri...|       negative@na|
|   136| #Susan Boyle did...|    na@negative@na|
|   145| &quot;I want you...|       na@negative|
|   195|- @Darcy_Lussier ...|    na@negative@na|
|   219| @khead I'll tell...| na@negative@na@na|
|   238|- @raybooysen I k...|          negative|
|   253|    Not feeling i...|          negative|
|   273|   No man is wort...|          negative|
|   288|        wish ella...|          negative|
|   305|    TODAY WAS A G...|          negative|
|   312|   fail math