![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/vivekn-sentiment/VivekNarayanSentimentApproach.ipynb)

## 0. Colab Setup

In [1]:
! wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-12-23 11:57:17--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-12-23 11:57:17--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-12-23 11:57:17--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:44

## Vivekn Sentiment Analysis

In the following example, we walk-through Sentiment Analysis training and prediction using Spark NLP Annotators.

The ViveknSentimentApproach annotator will compute [Vivek Narayanan algorithm](https://arxiv.org/pdf/1305.6143.pdf) with either a column in training dataset with rows labelled 'positive' or 'negative' or a folder full of positive text and a folder with negative text. Using n-grams and negation of sequences, this statistical model can achieve high accuracy if trained properly.

Spark can be leveraged in training by utilizing ReadAs.Dataset setting. Spark will be used during prediction by default.

We also include in this pipeline a spell checker which shall correct our sentences for better Sentiment Analysis accuracy.

#### 1. Call necessary imports and set the resource path to read local data files

In [2]:
#Imports
import time
import sys
import os

from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql.functions import array_contains,when
from pyspark.sql.functions import col

import sparknlp
from sparknlp.annotator import *
from sparknlp.base import DocumentAssembler, Finisher

#### 2. Load SparkSession if not already there

In [3]:
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  4.2.6
Apache Spark version:  3.2.3


In [4]:
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/spell/words.txt -P /tmp
!rm -rf /tmp/sentiment.parquet
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip -P /tmp
! unzip /tmp/sentiment.parquet.zip -d /tmp/

--2022-12-23 11:59:02--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/spell/words.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.166.48, 52.217.203.104, 3.5.20.150, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.166.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4862966 (4.6M) [text/plain]
Saving to: ‘/tmp/words.txt’


2022-12-23 11:59:03 (30.9 MB/s) - ‘/tmp/words.txt’ saved [4862966/4862966]

--2022-12-23 11:59:03--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.166.48, 52.217.203.104, 3.5.20.150, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.166.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76127532 (73M) [application/zip]
Saving to: ‘/tmp/sentiment.parquet.zip’


2022-12-23 11:59:05 (55.2 MB/s) - ‘/tmp/sentiment.parquet.zip’ saved [76127532/76127532]

Archi

 #### 3. Load a spark dataset and put it in memory

In [5]:
#Load the input data to be annotated
#We change 0 and 1 with negative and positive
data = spark. \
        read. \
        parquet("/tmp/sentiment.parquet"). \
        withColumn("sentiment_label", when(col("sentiment") == 0, "negative").otherwise("positive")). \
        limit(1000).cache()
data.show()

+------+---------+--------------------+---------------+
|itemid|sentiment|                text|sentiment_label|
+------+---------+--------------------+---------------+
|     1|        0|                 ...|       negative|
|     2|        0|                 ...|       negative|
|     3|        1|              omg...|       positive|
|     4|        0|          .. Omga...|       negative|
|     5|        0|         i think ...|       negative|
|     6|        0|         or i jus...|       negative|
|     7|        1|       Juuuuuuuuu...|       positive|
|     8|        0|       Sunny Agai...|       negative|
|     9|        1|      handed in m...|       positive|
|    10|        1|      hmmmm.... i...|       positive|
|    11|        0|      I must thin...|       negative|
|    12|        1|      thanks to a...|       positive|
|    13|        0|      this weeken...|       negative|
|    14|        0|     jb isnt show...|       negative|
|    15|        0|     ok thats it ...|       ne

#### 4. Create the document assembler, which will put target text column into Annotation form

In [6]:
### Define the dataframe
document_assembler = DocumentAssembler() \
            .setInputCol("text")\
            .setOutputCol("document")

In [7]:
### Example: Checkout the output of document assembler
assembled = document_assembler.transform(data)
assembled.show(5)

+------+---------+--------------------+---------------+--------------------+
|itemid|sentiment|                text|sentiment_label|            document|
+------+---------+--------------------+---------------+--------------------+
|     1|        0|                 ...|       negative|[{document, 0, 60...|
|     2|        0|                 ...|       negative|[{document, 0, 50...|
|     3|        1|              omg...|       positive|[{document, 0, 36...|
|     4|        0|          .. Omga...|       negative|[{document, 0, 13...|
|     5|        0|         i think ...|       negative|[{document, 0, 52...|
+------+---------+--------------------+---------------+--------------------+
only showing top 5 rows



#### 5. Create Sentence detector to parse sub sentences in every document

In [8]:
### Sentence detector
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

In [9]:
### Example: Checkout the output of sentence detector
sentence_data = sentence_detector.transform(assembled)
sentence_data.show(5)

+------+---------+--------------------+---------------+--------------------+--------------------+
|itemid|sentiment|                text|sentiment_label|            document|            sentence|
+------+---------+--------------------+---------------+--------------------+--------------------+
|     1|        0|                 ...|       negative|[{document, 0, 60...|[{document, 21, 4...|
|     2|        0|                 ...|       negative|[{document, 0, 50...|[{document, 19, 4...|
|     3|        1|              omg...|       positive|[{document, 0, 36...|[{document, 14, 3...|
|     4|        0|          .. Omga...|       negative|[{document, 0, 13...|[{document, 10, 1...|
|     5|        0|         i think ...|       negative|[{document, 0, 52...|[{document, 9, 42...|
+------+---------+--------------------+---------------+--------------------+--------------------+
only showing top 5 rows



#### 6. The tokenizer will match standard tokens

In [10]:
### Tokenizer
tokenizer = Tokenizer() \
            .setInputCols(["sentence"]) \
            .setOutputCol("token")


In [11]:
### Example: Checkout the outout of tokenizer
tokenized = tokenizer.fit(sentence_data).transform(sentence_data)
tokenized.show(5)

+------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|itemid|sentiment|                text|sentiment_label|            document|            sentence|               token|
+------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|     1|        0|                 ...|       negative|[{document, 0, 60...|[{document, 21, 4...|[{token, 21, 22, ...|
|     2|        0|                 ...|       negative|[{document, 0, 50...|[{document, 19, 4...|[{token, 19, 19, ...|
|     3|        1|              omg...|       positive|[{document, 0, 36...|[{document, 14, 3...|[{token, 14, 16, ...|
|     4|        0|          .. Omga...|       negative|[{document, 0, 13...|[{document, 10, 1...|[{token, 10, 10, ...|
|     5|        0|         i think ...|       negative|[{document, 0, 52...|[{document, 9, 42...|[{token, 9, 9, i,...|
+------+---------+--------------------+---------

#### 7. Normalizer will clean out the tokens

In [12]:
normalizer = Normalizer() \
            .setInputCols(["token"]) \
            .setOutputCol("normal")

#### 8. The spell checker will correct normalized tokens, this trains with a dictionary of english words

In [13]:
### Spell Checker
spell_checker = NorvigSweetingApproach() \
            .setInputCols(["normal"]) \
            .setOutputCol("spell") \
            .setDictionary("/tmp/words.txt")

#### 9. Create the ViveknSentimentApproach and set resources to train it

In [16]:
sentiment_detector = ViveknSentimentApproach() \
    .setInputCols(["spell", "sentence"])\
    .setOutputCol("sentiment")\
    .setSentimentCol("sentiment_label")\
    .setPruneCorpus(0)

#### 10. The finisher will utilize sentiment analysis output

In [17]:
finisher = Finisher() \
    .setInputCols(["sentiment"]) \
    .setIncludeMetadata(False)


##### 11. Fit and predict over data

In [18]:
pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    tokenizer,
    normalizer,
    spell_checker,
    sentiment_detector,
    finisher
])

start = time.time()
sentiment_data = pipeline.fit(data).transform(data)

end = time.time()
print("Time elapsed pipeline process: " + str(end - start))

Time elapsed pipeline process: 22.8889741897583


##### 13. Check the result

In [19]:
sentiment_data.show(5,False)

+------+------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------------------------------+
|itemid|text                                                                                                                                |sentiment_label|finished_sentiment                      |
+------+------------------------------------------------------------------------------------------------------------------------------------+---------------+----------------------------------------+
|1     |                     is so sad for my APL friend.............                                                                       |negative       |[negative]                              |
|2     |                   I missed the New Moon trailer...                                                                                 |negative       |[negative]                              |
|3   

In [20]:
type(sentiment_data)


pyspark.sql.dataframe.DataFrame

In [21]:
# Negative Sentiments
for r in sentiment_data.where(array_contains(sentiment_data.finished_sentiment, "negative")).take(5):
    print(r['text'].strip(),"->",r['finished_sentiment'])

is so sad for my APL friend............. -> ['negative']
I missed the New Moon trailer... -> ['negative']
.. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)... -> ['negative', 'negative', 'negative', 'negative']
i think mi bf is cheating on me!!!       T_T -> ['negative', 'na']
or i just worry too much? -> ['negative']


In [22]:
# Positive Sentiments
for r in sentiment_data.where(array_contains(sentiment_data.finished_sentiment, "positive")).take(5):
    print(r['text'].strip(),"->",r['finished_sentiment'])

omg its already 7:30 :O -> ['positive']
Juuuuuuuuuuuuuuuuussssst Chillin!! -> ['positive']
handed in my uniform today . i miss you already -> ['positive', 'negative']
hmmmm.... i wonder how she my number @-) -> ['na', 'positive']
thanks to all the haters up in my face all day! 112-102 -> ['positive']
