![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/dictionary-sentiment/sentiment.ipynb)

## 0. Colab Setup

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-12-23 11:26:26--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-12-23 11:26:26--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-12-23 11:26:26--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:44

## Rule-based Sentiment Analysis

In the following example, we walk-through a simple use case for our straight forward SentimentDetector annotator.

This annotator will work on top of a list of labeled sentences which can have any of the following features
    
    positive
    negative
    revert
    increment
    decrement

Each of these sentences will be used for giving a score to text 

#### 1. Call necessary imports and set the resource path to read local data files

In [2]:
#Imports
import sys
sys.path.append('../../')

import sparknlp

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.functions import array_contains
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher

#### 2. Load SparkSession if not already there

In [3]:
import sparknlp 

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  4.2.6
Apache Spark version:  3.2.3


In [4]:
! rm /tmp/sentiment.parquet.zip
! rm -rf /tmp/sentiment.parquet
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip -P /tmp
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/lemma-corpus-small/lemmas_small.txt -P /tmp
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment-corpus/default-sentiment-dict.txt -P /tmp    

rm: cannot remove '/tmp/sentiment.parquet.zip': No such file or directory
--2022-12-23 11:28:03--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sentiment.parquet.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.136.198, 52.216.112.141, 52.217.137.208, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.136.198|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76127532 (73M) [application/zip]
Saving to: ‘/tmp/sentiment.parquet.zip’


2022-12-23 11:28:05 (57.7 MB/s) - ‘/tmp/sentiment.parquet.zip’ saved [76127532/76127532]

--2022-12-23 11:28:05--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/lemma-corpus-small/lemmas_small.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.136.198, 52.216.112.141, 52.217.137.208, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.136.198|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 189437 (185K) [text/plain]
Saving

In [5]:
! unzip /tmp/sentiment.parquet.zip -d /tmp/

Archive:  /tmp/sentiment.parquet.zip
   creating: /tmp/sentiment.parquet/
  inflating: /tmp/sentiment.parquet/.part-00002-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet.crc  
  inflating: /tmp/sentiment.parquet/part-00002-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet  
  inflating: /tmp/sentiment.parquet/part-00003-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet  
  inflating: /tmp/sentiment.parquet/.part-00000-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet.crc  
  inflating: /tmp/sentiment.parquet/part-00001-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet  
 extracting: /tmp/sentiment.parquet/_SUCCESS  
  inflating: /tmp/sentiment.parquet/.part-00003-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet.crc  
  inflating: /tmp/sentiment.parquet/part-00000-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet  
  inflating: /tmp/sentiment.parquet/.part-00001-08092d15-dd8c-40f9-a1df-641a1a4b1698.snappy.parquet.crc  


In [6]:
data = spark. \
        read. \
        parquet("/tmp/sentiment.parquet"). \
        limit(10000).cache()

data.show()

+------+---------+--------------------+
|itemid|sentiment|                text|
+------+---------+--------------------+
|     1|        0|                 ...|
|     2|        0|                 ...|
|     3|        1|              omg...|
|     4|        0|          .. Omga...|
|     5|        0|         i think ...|
|     6|        0|         or i jus...|
|     7|        1|       Juuuuuuuuu...|
|     8|        0|       Sunny Agai...|
|     9|        1|      handed in m...|
|    10|        1|      hmmmm.... i...|
|    11|        0|      I must thin...|
|    12|        1|      thanks to a...|
|    13|        0|      this weeken...|
|    14|        0|     jb isnt show...|
|    15|        0|     ok thats it ...|
|    16|        0|    &lt;-------- ...|
|    17|        0|    awhhe man.......|
|    18|        1|    Feeling stran...|
|    19|        0|    HUGE roll of ...|
|    20|        0|    I just cut my...|
+------+---------+--------------------+
only showing top 20 rows



#### 3. Create appropriate annotators. We are using Sentence Detection, Tokenizing the sentences, and find the lemmas of those tokens. The Finisher will only output the Sentiment.

In [7]:
document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary("/tmp/lemmas_small.txt", key_delimiter="->", value_delimiter="\t")
        
sentiment_detector = SentimentDetector() \
    .setInputCols(["lemma", "sentence"]) \
    .setOutputCol("sentiment_score") \
    .setDictionary("/tmp/default-sentiment-dict.txt", ",")
    
finisher = Finisher() \
    .setInputCols(["sentiment_score"]) \
    .setOutputCols(["sentiment"])

#### 4. Train the pipeline, which is only being trained from external resources, not from the dataset we pass on. The prediction runs on the target dataset

In [8]:
pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, lemmatizer, sentiment_detector, finisher])
model = pipeline.fit(data)
result = model.transform(data)

#### 5. filter the finisher output, to find the positive sentiment lines

In [9]:
result.where(array_contains(result.sentiment, "positive")).show(10,False)

+------+----------+------------------------------------------------------------------------------------------------------------------------------------+
|itemid|sentiment |text                                                                                                                                |
+------+----------+------------------------------------------------------------------------------------------------------------------------------------+
|1     |[positive]|                     is so sad for my APL friend.............                                                                       |
|2     |[positive]|                   I missed the New Moon trailer...                                                                                 |
|3     |[positive]|              omg its already 7:30 :O                                                                                               |
|4     |[positive]|          .. Omgaga. Im sooo  im gunna CRy. I've been at this d