<a href="https://colab.research.google.com/github/Maksym-Tymchenko/johnsnow/blob/main/SENTIMENT_DETECTION_USING_SNOW_LABS_PIPELINES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis for News Articles**

## 1. Colab Setup

In [2]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.1.2 spark-nlp

# Install Spark NLP Display lib
! pip install --upgrade -q spark-nlp-display

[K     |████████████████████████████████| 212.4 MB 69 kB/s 
[K     |████████████████████████████████| 140 kB 69.9 MB/s 
[K     |████████████████████████████████| 198 kB 24.9 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 95 kB 3.7 MB/s 
[K     |████████████████████████████████| 66 kB 7.6 MB/s 
[?25h

In [17]:
import sparknlp
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from tabulate import tabulate
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.4.0
Apache Spark version:  3.1.2


## 2. Some sample examples

In [62]:
text_list = [
"""In April 2005 , Neste separated from its parent company , Finnish energy company Fortum , and became listed on the Helsinki Stock Exchange .""",
"""Finnish IT solutions provider Affecto Oyj HEL : AFE1V said today its slipped to a net loss of EUR 115,000 USD 152,000 in the second quarter of 2010 from a profit of EUR 845,000 in the corresponding period a year earlier .""",             
"""10 February 2011 - Finnish media company Sanoma Oyj HEL : SAA1V said yesterday its 2010 net profit almost tripled to EUR297 .3 m from EUR107 .1 m for 2009 and announced a proposal for a raised payout .""",
"""Profit before taxes decreased by 9 % to EUR 187.8 mn in the first nine months of 2008 , compared to EUR 207.1 mn a year earlier .""",
"""The world 's second largest stainless steel maker said net profit in the three-month period until Dec. 31 surged to euro603 million US$ 781 million , or euro3 .33 US$ 4.31 per share , from euro172 million , or euro0 .94 per share , the previous year .""",
"""TietoEnator signed an agreement to acquire Indian research and development ( R&D ) services provider and turnkey software solutions developer Fortuna Technologies Pvt. Ltd. for 21 mln euro ( $ 30.3 mln ) in September 2007 .""",
]

## 3. Define Spark NLP pipeline

In [8]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

embeddings = BertSentenceEmbeddings\
    .pretrained('sent_bert_wiki_books_sst2', 'en') \
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

sentimentClassifier = ClassifierDLModel.pretrained("classifierdl_bertwiki_finance_sentiment", "en") \
  .setInputCols(["document", "sentence_embeddings"]) \
  .setOutputCol("class")

financial_sentiment_pipeline = Pipeline(stages=[document, embeddings, sentimentClassifier])

light_pipeline = LightPipeline(financial_sentiment_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

sent_bert_wiki_books_sst2 download started this may take some time.
Approximate size to download 389.7 MB
[OK!]
classifierdl_bertwiki_finance_sentiment download started this may take some time.
Approximate size to download 22.5 MB
[OK!]


## 4. Run the pipeline

In [75]:
empty_df = spark.createDataFrame([['']]).toDF("text")
pipelineModel = financial_sentiment_pipeline.fit(empty_df)

# Create dataframe from text_list defined above
df = spark.createDataFrame(pd.DataFrame({"text":text_list}))
result = pipelineModel.transform(df)
print(result.show())

+--------------------+--------------------+--------------------+--------------------+
|                text|            document| sentence_embeddings|               class|
+--------------------+--------------------+--------------------+--------------------+
|In April 2005 , N...|[{document, 0, 13...|[{sentence_embedd...|[{category, 0, 13...|
|Finnish IT soluti...|[{document, 0, 22...|[{sentence_embedd...|[{category, 0, 22...|
|10 February 2011 ...|[{document, 0, 20...|[{sentence_embedd...|[{category, 0, 20...|
|Profit before tax...|[{document, 0, 12...|[{sentence_embedd...|[{category, 0, 12...|
|The world 's seco...|[{document, 0, 25...|[{sentence_embedd...|[{category, 0, 25...|
|TietoEnator signe...|[{document, 0, 22...|[{sentence_embedd...|[{category, 0, 22...|
+--------------------+--------------------+--------------------+--------------------+

None


## 5. Visualize results of custom pipeline

In [96]:
result.select(F.explode(F.arrays_zip('document.result', 'class.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("document"),
        F.expr("cols['1']").alias("class")).show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|document                                                                                                                                                                                                                                                   |class   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|In April 2005 , Neste separated from its parent company , Finnish energy company Fortum , and became listed on the Helsinki Stock Exchange .                                                                      

# Annotate Spark Dataframe using analyze_sentiment pipeline

Import the pretrained pipeline analyze_sentiment and predict the sentiment of a toy sentence.

In [98]:
pipeline = PretrainedPipeline('analyze_sentiment', lang = 'en')
annotations =  pipeline.fullAnnotate("Hello from John Snow Labs ! ")[0]
annotations["sentiment"]

analyze_sentiment download started this may take some time.
Approx size to download 4.9 MB
[OK!]


[Annotation(sentiment, 0, 26, negative, {'confidence': '0.3729'})]

# Annotate Spark Dataframe using analyze_sentimentdl_glove_imbd pipeline

### Annotate simple sentence

Import the pretrained pipeline analyze_sentimentdl_glove_imbd and predict the sentiment of a simple sentence.

In [97]:
pipeline = PretrainedPipeline('analyze_sentimentdl_glove_imdb', lang = 'en')

# Annotate simple sentence
annotations =  pipeline.annotate("Hello from John Snow Labs ! ")
print("Sentence annotation: ")
print(annotations["sentiment"])

analyze_sentimentdl_glove_imdb download started this may take some time.
Approx size to download 155.3 MB
[OK!]
Sentence annotation: 
['neg']


### Annotate small dataframe

Import the pretrained pipeline analyze_sentimentdl_glove_imbd and predict the sentiment of a collection of sentences stored in a small dataframe.

In [95]:
# Create toy dataframe 
sentences = [
  ['Hello, this is an example sentence'],
  ['And this is a second sentence.']
]

# spark is the Spark Session automatically started by pyspark.
data = spark.createDataFrame(sentences).toDF("text")

# Transform 'data' and store output in a new 'annotations_df' dataframe
annotations_df = pipeline.transform(data)

# Extract sentiment only
essential_df = annotations_df.select('text', 'sentiment.result')

# Show the results
essential_df.show()

+--------------------+------+
|                text|result|
+--------------------+------+
|Hello, this is an...| [neg]|
|And this is a sec...| [pos]|
+--------------------+------+

