<img src="https://nlp.johnsnowlabs.com/assets/images/logo.png" width="180" height="50" style="float: left;">

## Sentiment Analysis Pipeline

This pipeline will be used to explain a number of important features of the Spark-NLP library; Sentence Detection, Tokenization, Spell Checking, and Sentiment Detection.
The idea is to start with natural language as could have been entered by a user, and get sentiment associated to it. Let's walk through each of the stages!


### Spark `2.4` and Spark NLP `1.8.3`

#### 1. Call necessary imports and set the resource path to read local data files

In [None]:
#Imports
import sys
sys.path.append('../../')

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.functions import array_contains
from sparknlp.annotator import *

# location of pre-trained pipelines
resource_path= "../demo_pipelines/movies_sentiment"

#### 2. Load SparkSession if not already there

In [None]:
spark = SparkSession.builder \
    .appName("SentimentDetector")\
    .master("local[*]")\
    .config("spark.driver.memory","8G")\
    .config("spark.driver.maxResultSize", "2G")\
    .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:1.8.3")\
    .config("spark.kryoserializer.buffer.max", "500m")\
    .getOrCreate()

#### 3. Load our predefined pipeline containing all the interesting annotators.

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol("text")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary(resource_path+"lemma-corpus-small/lemmas_small.txt", key_delimiter="->", value_delimiter="\t")
        
sentiment_detector = SentimentDetector() \
    .setInputCols(["lemma", "sentence"]) \
    .setOutputCol("sentiment_score") \
    .setDictionary(resource_path+"sentiment-corpus/default-sentiment-dict.txt", ",")
    
finisher = Finisher() \
    .setInputCols(["sentiment_score"]) \
    .setOutputCols(["sentiment"])

#### 4. Train the pipeline, which is only being trained from external resources, not from the dataset we pass on. The prediction runs on the target dataset

In [None]:
pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, lemmatizer, sentiment_detector, finisher])
model = pipeline.fit(data)
result = model.transform(data)

#### 5. filter the finisher output, to find the positive sentiment lines

In [None]:
result.where(array_contains(result.sentiment, "positive")).show(10,False)