<a href="https://colab.research.google.com/github/Brand-Sentiment-Tracking/python-package/blob/main/CLASS_FOR_SENTIMENT_DETECTION_USING_SNOW_LABS_PIPELINES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Class for Sentiment Analysis for News Articles**

## 1. Colab Setup

In [55]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.1.2 spark-nlp

# Install Spark NLP Display lib
! pip install --upgrade -q spark-nlp-display

In [61]:
import sparknlp
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from tabulate import tabulate
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from sparknlp_display import NerVisualizer

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.4.0
Apache Spark version:  3.1.2


# Define a News Article

In [57]:
article = [ # two strings - headline & article body
"""Google sued in US over 'deceptive' location tracking""", # headline
"""Google is being sued in the US over accusations it deceived people about how to control location tracking.

The legal action refers to a widely reported 2018 revelation turning off one location-tracking setting in its apps was insufficient to fully disable the feature.

It accuses Google of using so-called dark patterns, marketing techniques that deliberately confuse.

Google said the claims were inaccurate and outdated.

'Unfair practices'
The legal action was filed in the District of Columbia. Similar ones were also filed in Texas, Indiana and Washington state.

It refers to an Associated Press revelation turning off Location History when using Google Maps or Search was insufficient - as a separate setting, Web and App Activity, continued to log location and other personal data.

The study, with researchers at Princeton University, found up to two billion Android and Apple devices could be affected.

"Google has relied on, and continues to rely on, deceptive and unfair practices that make it difficult for users to decline location tracking or to evaluate the data collection and processing to which they are purportedly consenting," the legal action alleges.

'Robust controls'
Google told BBC News the case was based "on inaccurate claims and outdated assertions about our settings".

A representative added: "We have always built privacy features into our products and provided robust controls for location data.

"We will vigorously defend ourselves and set the record straight."

Visual misdirection
The legal action claims Google's policies contained other "misleading, ambiguous and incomplete descriptions... but guarantee that consumers will not understand when their location is collected and retained by Google or for what purposes".

It refers to dark patterns, design choices that alter users' decision-making for the designer's benefit - such as, complicated navigation menus, visual misdirection, confusing wording and repeated nudging towards a particular outcome.

Data regulators are increasingly focusing on these practices.

Google faces a raft of other legal actions in the US, including:

In May 2020, Arizona filed a legal action over the same issue
In December 2020, multiple US states sued over the price and process of advertising auctions
In October 2020, the US Justice Department alleged Google had a monopoly over search and search advertising"""]



## Define the Brand Identification Class

In [58]:
class BrandIdentification:
    def __init__(self, MODEL_NAME, text_list):
        self.MODEL_NAME = MODEL_NAME
        self.headline = text_list[0]
        self.body = text_list[1]

        # Define Spark NLP pipeline 
        documentAssembler = DocumentAssembler() \
            .setInputCol('text') \
            .setOutputCol('document')

        tokenizer = Tokenizer() \
            .setInputCols(['document']) \
            .setOutputCol('token')

        # ner_dl and onto_100 model are trained with glove_100d, so the embeddings 
        # in the pipeline should match
        if (self.MODEL_NAME == "ner_dl") or (self.MODEL_NAME == "onto_100"):
            embeddings = WordEmbeddingsModel.pretrained('glove_100d') \
                .setInputCols(["document", 'token']) \
                .setOutputCol("embeddings")

        # Bert model uses Bert embeddings
        elif self.MODEL_NAME == "ner_dl_bert":
            embeddings = BertEmbeddings.pretrained(name='bert_base_cased', lang='en') \
                .setInputCols(['document', 'token']) \
                .setOutputCol('embeddings')

        ner_model = NerDLModel.pretrained(MODEL_NAME, 'en') \
            .setInputCols(['document', 'token', 'embeddings']) \
            .setOutputCol('ner')

        ner_converter = NerConverter() \
            .setInputCols(['document', 'token', 'ner']) \
            .setOutputCol('ner_chunk')

        nlp_pipeline = Pipeline(stages=[
            documentAssembler, 
            tokenizer,
            embeddings,
            ner_model,
            ner_converter
        ])
        
        # Get the pipeline model
        empty_df = spark.createDataFrame([['']]).toDF('text')
        self.pipeline_model = nlp_pipeline.fit(empty_df)


    def predict_by_headline(self):
        # Run the pipeline for the headline
        text_df_hl = spark.createDataFrame(pd.DataFrame({'text': self.headline}, index = [0]))
        self.result_hl = self.pipeline_model.transform(text_df_hl)
        
        # Tabulate the results
        df = self.result_hl.select(F.explode(F.arrays_zip('document.result', 'ner_chunk.result',"ner_chunk.metadata")).alias("cols")).select(\
        F.expr("cols['1']").alias("chunk"),
        F.expr("cols['2'].entity").alias('result'))
        
        # Rank the identified ORGs by frequencies
        self.ranked_df_hl = df.filter(df.result == 'ORG').groupBy(df.chunk).count().orderBy('count', ascending=False)
        
        # If only one ORG appears in headline, return it 
        if self.ranked_df_hl.count() == 1:
            return self.ranked_df_hl.first()[0] 
        else: # If no ORG appears, or multiple ORGs all appear once, return None
            return None


    def predict(self):
        result_by_headline = self.predict_by_headline()

        # Use the prediction from headline if we get any
        if result_by_headline != None:
            return self.predict_by_headline() 
        else:
            # Run the pipeline for the article body
            text_df_ar = spark.createDataFrame(pd.DataFrame({'text': self.body}, index = [0]))
            self.result = self.pipeline_model.transform(text_df_ar)
            
            # Tabulate the results
            df = self.result.select(F.explode(F.arrays_zip('document.result', 'ner_chunk.result',"ner_chunk.metadata")).alias("cols")).select(\
            F.expr("cols['1']").alias("chunk"),
            F.expr("cols['2'].entity").alias('result'))
            
            # Rank the identified ORGs by frequencies
            self.ranked_df = df.filter(df.result == 'ORG').groupBy(df.chunk).count().orderBy('count', ascending=False)
            
            # Return the ORG with highest freq (at least greater than 2)
            if self.ranked_df.first()[1] > 2: 
                return self.ranked_df.first()[0] 
            else:
                return None

            # TO DO: break even - consider Wikidata#


    def visualise(self, ranked_df, result):
        # Visualise the table of freq
        ranked_df.show(100, truncate=False)

        # Visualise ORG names in text
        NerVisualizer().display(
            result = result.collect()[0],
            label_col = 'ner_chunk',
            document_col = 'document',
            labels=['ORG']
        )


## Define the Senitment Identification Class

In [59]:
class SentimentIdentification:

    def __init__(self, MODEL_NAME):
        """Creates a class for sentiment identication using specified model.

        Args:
          MODEL_NAME: Name of the Spark NLP pretrained pipeline.
        """

        # Create the pipeline instance
        self.MODEL_NAME = MODEL_NAME
        self.pipeline_model = PretrainedPipeline(self.MODEL_NAME, lang = 'en')


    def predict(self, text):
        """Predicts sentiment of the input string..

        Args:
          text: String to classify.
        """
        self.text = text

        # Annotate simple sentence
        annotations =  self.pipeline_model.annotate(self.text)
        print(f"{annotations['sentiment']} {annotations['document']}")

## 4. Identify Brand in news article


In [62]:
MODEL_NAME = "onto_100"

brand = BrandIdentification(MODEL_NAME, article)

print(f"Brand identified by headline: {brand.predict_by_headline()}")
brand.visualise(brand.ranked_df_hl, brand.result_hl)

print(f"Brand identified by the whole article: {brand.predict()}")
brand.visualise(brand.ranked_df, brand.result)

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]
onto_100 download started this may take some time.
Approximate size to download 13.5 MB
[OK!]
Brand identified by headline: None
+-----+-----+
|chunk|count|
+-----+-----+
+-----+-----+



Brand identified by the whole article: Google
+--------------------+-----+
|chunk               |count|
+--------------------+-----+
|Google              |3    |
|BBC News            |1    |
|Associated Press    |1    |
|Princeton University|1    |
|Location History    |1    |
|Justice Department  |1    |
|Apple               |1    |
+--------------------+-----+



# Classify Article using analyze_sentimentdl_glove_imdb pipeline

In [63]:
identifier = SentimentIdentification(MODEL_NAME = "analyze_sentimentdl_glove_imdb")

# Predict by headline
headline = article[0]
identifier.predict(headline)

# Predict by body
body = article[1]
identifier.predict(body)

analyze_sentimentdl_glove_imdb download started this may take some time.
Approx size to download 155.3 MB
[OK!]
['neg'] ["Google sued in US over 'deceptive' location tracking"]
['neg'] ['Google is being sued in the US over accusations it deceived people about how to control location tracking.\n\nThe legal action refers to a widely reported 2018 revelation turning off one location-tracking setting in its apps was insufficient to fully disable the feature.\n\nIt accuses Google of using so-called dark patterns, marketing techniques that deliberately confuse.\n\nGoogle said the claims were inaccurate and outdated.\n\n\'Unfair practices\'\nThe legal action was filed in the District of Columbia. Similar ones were also filed in Texas, Indiana and Washington state.\n\nIt refers to an Associated Press revelation turning off Location History when using Google Maps or Search was insufficient - as a separate setting, Web and App Activity, continued to log location and other personal data.\n\nThe s