<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

## Vivekn Sentiment Analysis

In the following example, we walk-through Sentiment Analysis training and prediction using Spark NLP Annotators.

The ViveknSentimentApproach annotator will compute [Vivek Narayanan algorithm](https://arxiv.org/pdf/1305.6143.pdf) with either a column in training dataset with rows labelled 'positive' or 'negative' or a folder full of positive text and a folder with negative text. Using n-grams and negation of sequences, this statistical model can achieve high accuracy if trained properly.

Spark can be leveraged in training by utilizing ReadAs.Dataset setting. Spark will be used during prediction by default.

We also include in this pipeline a spell checker which shall correct our sentences for better Sentiment Analysis accuracy.

### Spark `2.4` and Spark NLP `2.0.0`

#### 1. Call necessary imports and set the resource path to read local data files

In [1]:
#Imports
import time
import sys
import os

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql.functions import array_contains,when
from pyspark.sql.functions import col

import sparknlp
from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Finisher

#### 2. Load SparkSession if not already there

In [2]:
spark = sparknlp.start()

In [3]:
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/spell/words.txt -P /tmp
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip -P /tmp
! unzip /tmp/sentiment.parquet.zip -d /tmp/

--2019-03-21 00:51:14--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/spell/words.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.109.21
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.109.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4862966 (4.6M) [text/plain]
Saving to: ‘/tmp/words.txt’


2019-03-21 00:51:16 (3.09 MB/s) - ‘/tmp/words.txt’ saved [4862966/4862966]

--2019-03-21 00:51:16--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.109.21
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.109.21|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘/tmp/sentiment.parquet.zip’ not modified on server. Omitting download.

Archive:  /tmp/sentiment.parquet.zip
   creating: /tmp/sentiment.parquet/
  inflating: /tmp/sentiment.parquet/.part-00002-08092d15-dd8c-40f9-a1df-641a1a4b16

 #### 3. Load a spark dataset and put it in memory

In [4]:
#Load the input data to be annotated
#We change 0 and 1 with negative and positive
data = spark. \
        read. \
        parquet("/tmp/sentiment.parquet"). \
        withColumn("sentiment_label", when(col("sentiment") == 0, "negative").otherwise("positive")). \
        limit(1000).cache()
data.show()

+------+---------+--------------------+---------------+
|itemid|sentiment|                text|sentiment_label|
+------+---------+--------------------+---------------+
|799033|        0|@FrankomQ8 What's...|       negative|
|799034|        1|@FranKoUK guitar ...|       positive|
|799035|        0|@frankparenteau u...|       negative|
|799036|        1|@frankparenteau w...|       positive|
|799037|        1|@FrankPatris dude...|       positive|
|799038|        0|@FrankRamblings a...|       negative|
|799039|        1|@frankroberts  ni...|       positive|
|799040|        0|@frankroberts ur ...|       negative|
|799041|        1|@FrankS Breaking ...|       positive|
|799042|        1|@frankschultelad ...|       positive|
|799043|        0|@frankshorter Wol...|       negative|
|799044|        0|@franksting - its...|       negative|
|799045|        1|@franksting Ha! D...|       positive|
|799046|        1|@franksting yeah,...|       positive|
|799047|        1|@franksting yes, ...|       po

#### 4. Create the document assembler, which will put target text column into Annotation form

In [5]:
### Define the dataframe
document_assembler = DocumentAssembler() \
            .setInputCol("text")\
            .setOutputCol("document")

In [6]:
### Example: Checkout the output of document assembler
assembled = document_assembler.transform(data)
assembled.show(5)

+------+---------+--------------------+---------------+--------------------+
|itemid|sentiment|                text|sentiment_label|            document|
+------+---------+--------------------+---------------+--------------------+
|799033|        0|@FrankomQ8 What's...|       negative|[[document, 0, 36...|
|799034|        1|@FranKoUK guitar ...|       positive|[[document, 0, 46...|
|799035|        0|@frankparenteau u...|       negative|[[document, 0, 11...|
|799036|        1|@frankparenteau w...|       positive|[[document, 0, 77...|
|799037|        1|@FrankPatris dude...|       positive|[[document, 0, 73...|
+------+---------+--------------------+---------------+--------------------+
only showing top 5 rows



#### 5. Create Sentence detector to parse sub sentences in every document

In [7]:
### Sentence detector
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

In [8]:
### Example: Checkout the output of sentence detector
sentence_data = sentence_detector.transform(assembled)
sentence_data.show(5)

+------+---------+--------------------+---------------+--------------------+--------------------+
|itemid|sentiment|                text|sentiment_label|            document|            sentence|
+------+---------+--------------------+---------------+--------------------+--------------------+
|799033|        0|@FrankomQ8 What's...|       negative|[[document, 0, 36...|[[document, 0, 23...|
|799034|        1|@FranKoUK guitar ...|       positive|[[document, 0, 46...|[[document, 0, 43...|
|799035|        0|@frankparenteau u...|       negative|[[document, 0, 11...|[[document, 0, 36...|
|799036|        1|@frankparenteau w...|       positive|[[document, 0, 77...|[[document, 0, 71...|
|799037|        1|@FrankPatris dude...|       positive|[[document, 0, 73...|[[document, 0, 35...|
+------+---------+--------------------+---------------+--------------------+--------------------+
only showing top 5 rows



#### 6. The tokenizer will match standard tokens

In [9]:
### Tokenizer
tokenizer = Tokenizer() \
            .setInputCols(["sentence"]) \
            .setOutputCol("token")


In [10]:
### Example: Checkout the outout of tokenizer
tokenized = tokenizer.transform(sentence_data)
tokenized.show(5)

+------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|itemid|sentiment|                text|sentiment_label|            document|            sentence|               token|
+------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|799033|        0|@FrankomQ8 What's...|       negative|[[document, 0, 36...|[[document, 0, 23...|[[token, 0, 0, @,...|
|799034|        1|@FranKoUK guitar ...|       positive|[[document, 0, 46...|[[document, 0, 43...|[[token, 0, 0, @,...|
|799035|        0|@frankparenteau u...|       negative|[[document, 0, 11...|[[document, 0, 36...|[[token, 0, 0, @,...|
|799036|        1|@frankparenteau w...|       positive|[[document, 0, 77...|[[document, 0, 71...|[[token, 0, 0, @,...|
|799037|        1|@FrankPatris dude...|       positive|[[document, 0, 73...|[[document, 0, 35...|[[token, 0, 0, @,...|
+------+---------+--------------------+---------

#### 7. Normalizer will clean out the tokens

In [11]:
normalizer = Normalizer() \
            .setInputCols(["token"]) \
            .setOutputCol("normal")

#### 8. The spell checker will correct normalized tokens, this trains with a dictionary of english words

In [12]:
### Spell Checker
spell_checker = NorvigSweetingApproach() \
            .setInputCols(["normal"]) \
            .setOutputCol("spell") \
            .setDictionary("/tmp/words.txt")

#### 9. Create the ViveknSentimentApproach and set resources to train it

In [13]:
sentiment_detector = ViveknSentimentApproach() \
    .setInputCols(["spell", "sentence"]) \
    .setOutputCol("sentiment") \
    .setSentimentCol("sentiment_label") \
    .setPruneCorpus(0) \

#### 10. The finisher will utilize sentiment analysis output

In [14]:
finisher = Finisher() \
    .setInputCols(["sentiment"]) \
    .setIncludeMetadata(False)


##### 11. Fit and predict over data

In [15]:
pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    normalizer,
    spell_checker,
    sentiment_detector,
    finisher
])

start = time.time()
sentiment_data = pipeline.fit(data).transform(data)

end = time.time()
print("Time elapsed pipeline process: " + str(end - start))

Time elapsed pipeline process: 6.119040012359619


##### 13. Check the result

In [16]:
sentiment_data.show(5,False)

+------+-----------------------------------------------------------------------------------------------------------------+---------------+--------------------+
|itemid|text                                                                                                             |sentiment_label|finished_sentiment  |
+------+-----------------------------------------------------------------------------------------------------------------+---------------+--------------------+
|799033|@FrankomQ8 What's in it? I'm starving                                                                            |negative       |[negative, negative]|
|799034|@FranKoUK guitar lessons soundsss good  when?  xx                                                                |positive       |[positive, positive]|
|799035|@frankparenteau used and abused, huh? i do feel like that sometimes, especially when clients ignore my invoices. |negative       |[negative, negative]|
|799036|@frankparenteau well, with itune

In [17]:
type(sentiment_data)


pyspark.sql.dataframe.DataFrame

In [18]:
# Negative Sentiments
for r in sentiment_data.where(array_contains(sentiment_data.finished_sentiment, "negative")).take(5):
    print(r['text'].strip(),"->",r['finished_sentiment'])

@FrankomQ8 What's in it? I'm starving -> ['negative', 'negative']
@frankparenteau used and abused, huh? i do feel like that sometimes, especially when clients ignore my invoices. -> ['negative', 'negative']
@FrankRamblings and I CAN listen to them on Itunes, so it's not a corrupted file problem -> ['negative']
@frankroberts ur on top of ur twitter game! thanks hun. like you, I wish I updated more. Dig your writings as well. Sad news RE: Octavia -> ['negative', 'negative', 'negative', 'negative', 'negative']
@frankshorter Wolfram cant do 3900000000000000-3800000000000000 -> ['negative']


In [19]:
# Positive Sentiments
for r in sentiment_data.where(array_contains(sentiment_data.finished_sentiment, "positive")).take(5):
    print(r['text'].strip(),"->",r['finished_sentiment'])

@FranKoUK guitar lessons soundsss good  when?  xx -> ['positive', 'positive']
@frankparenteau well, with itunes he gets more hits w/o promo, probably. eh... -> ['positive', 'positive']
@FrankPatris dude have you season 5? i finished 16 eoisodes of season 4 na -> ['positive', 'positive']
@frankroberts  nice.  this has happened to me like 20 times. lol.  pace yourself. -> ['positive', 'positive', 'positive', 'positive']
@FrankS Breaking Bad: eher Drama mit Comedy, Entourage: Single Camera Comedy, muss man auf jeden Fall gesehen haben -> ['positive']
