<a href="https://colab.research.google.com/github/Brand-Sentiment-Tracking/python-package/blob/main/johnsnow/Charlize%20-%20Sentiment%20Detection%20John%20Snows.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Class for Sentiment Analysis for News Articles**

## Colab Setup

In [2]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.1.2 spark-nlp

# Install Spark NLP Display lib
! pip install --upgrade -q spark-nlp-display

[K     |████████████████████████████████| 212.4 MB 72 kB/s 
[K     |████████████████████████████████| 140 kB 67.2 MB/s 
[K     |████████████████████████████████| 198 kB 76.1 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 95 kB 3.6 MB/s 
[K     |████████████████████████████████| 66 kB 5.8 MB/s 
[?25h

In [3]:
import sparknlp
import pandas as pd
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from tabulate import tabulate
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from sparknlp_display import NerVisualizer

spark = sparknlp.start(gpu=False)
# spark = sparknlp.start(gpu=True)

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.4.1
Apache Spark version:  3.1.2


## Define a News Article

In [4]:
article = [ # two strings - headline & article body
"""Google sued in US over 'deceptive' location tracking""", # headline
"""Google is being sued in the US over accusations it deceived people about how to control location tracking.

The legal action refers to a widely reported 2018 revelation turning off one location-tracking setting in its apps was insufficient to fully disable the feature.

It accuses Google of using so-called dark patterns, marketing techniques that deliberately confuse.

Google said the claims were inaccurate and outdated.

'Unfair practices'
The legal action was filed in the District of Columbia. Similar ones were also filed in Texas, Indiana and Washington state.

It refers to an Associated Press revelation turning off Location History when using Google Maps or Search was insufficient - as a separate setting, Web and App Activity, continued to log location and other personal data.

The study, with researchers at Princeton University, found up to two billion Android and Apple devices could be affected.

"Google has relied on, and continues to rely on, deceptive and unfair practices that make it difficult for users to decline location tracking or to evaluate the data collection and processing to which they are purportedly consenting," the legal action alleges.

'Robust controls'
Google told BBC News the case was based "on inaccurate claims and outdated assertions about our settings".

A representative added: "We have always built privacy features into our products and provided robust controls for location data.

"We will vigorously defend ourselves and set the record straight."

Visual misdirection
The legal action claims Google's policies contained other "misleading, ambiguous and incomplete descriptions... but guarantee that consumers will not understand when their location is collected and retained by Google or for what purposes".

It refers to dark patterns, design choices that alter users' decision-making for the designer's benefit - such as, complicated navigation menus, visual misdirection, confusing wording and repeated nudging towards a particular outcome.

Data regulators are increasingly focusing on these practices.

Google faces a raft of other legal actions in the US, including:

In May 2020, Arizona filed a legal action over the same issue
In December 2020, multiple US states sued over the price and process of advertising auctions
In October 2020, the US Justice Department alleged Google had a monopoly over search and search advertising"""]



## Define the Brand Identification Class

In [10]:
class BrandIdentification:
    def __init__(self, MODEL_NAME):
        self.MODEL_NAME = MODEL_NAME

        # Define Spark NLP pipeline 
        documentAssembler = DocumentAssembler() \
            .setInputCol('text') \
            .setOutputCol('document')

        tokenizer = Tokenizer() \
            .setInputCols(['document']) \
            .setOutputCol('token')

        # ner_dl and onto_100 model are trained with glove_100d, so the embeddings in the pipeline should match
        if (self.MODEL_NAME == "ner_dl") or (self.MODEL_NAME == "onto_100"):
            embeddings = WordEmbeddingsModel.pretrained('glove_100d') \
                .setInputCols(["document", 'token']) \
                .setOutputCol("embeddings")

        # Bert model uses Bert embeddings
        elif self.MODEL_NAME == "ner_dl_bert":
            embeddings = BertEmbeddings.pretrained(name='bert_base_cased', lang='en') \
                .setInputCols(['document', 'token']) \
                .setOutputCol('embeddings')

        ner_model = NerDLModel.pretrained(MODEL_NAME, 'en') \
            .setInputCols(['document', 'token', 'embeddings']) \
            .setOutputCol('ner')

        ner_converter = NerConverter() \
            .setInputCols(['document', 'token', 'ner']) \
            .setOutputCol('ner_chunk')

        nlp_pipeline = Pipeline(stages=[
            documentAssembler, 
            tokenizer,
            embeddings,
            ner_model,
            ner_converter
        ])
        
        # Create the pipeline model
        empty_df = spark.createDataFrame([['']]).toDF('text') # An empty df with column name "text"
        self.pipeline_model = nlp_pipeline.fit(empty_df)


    def create_ranked_table(self, text): # text could be a pandas dataframe with a column "text" or a list of strings or a single string
        # Run the pipeline for the text
        if isinstance(text, pd.DataFrame): text_df = spark.createDataFrame(text) # If input a pandas dataframe
        elif isinstance(text, str): text_df = spark.createDataFrame(pd.DataFrame({'text': text}, index=[0])) # If input a single string
        else: text_df = spark.createDataFrame(pd.DataFrame({'text': text})) # If input a list of strings
        result = self.pipeline_model.transform(text_df)
        
        # Create a table with only entity names and types
        df = result.select(F.explode(F.arrays_zip('document.result', 'ner_chunk.result',"ner_chunk.metadata")).alias("cols")).select(\
        F.expr("cols['1']").alias("chunk"),
        F.expr("cols['2'].entity").alias('result'))
        
        # Filter only ORGs
        df = df.filter(df.result == 'ORG')

        # Rank the ORGs by frequencies
        ranked_df = df.groupBy(df.chunk).count().orderBy('count', ascending=False)

        return df, ranked_df


    def predict_by_headline(self, headline): # headline can be a pd df, a list of string or a single string
        _, ranked_df_hl = self.create_ranked_table(headline)
        # ranked_df_hl.show(100, truncate=False)
        
        # If no ORG identified in headline, return None
        if ranked_df_hl.count() == 0:
            return None
        # If only one ORG appears in headline, return it
        elif ranked_df_hl.count() == 1:
            return ranked_df_hl.first()[0]
        # If one ORG appear more than the others, return the first one 
        elif ranked_df_hl.first()[1] > ranked_df_hl.collect()[1][1]:
            return ranked_df_hl.first()[0] 
        else: # If multiple ORGs appear the same time, leave decision to article body (TO BE MODIFIED)
            return None
            # return ranked_df_hl.first()[0] 


    def predict(self, headline, body): # body can be a pd df, a list of string or a single string
        _, ranked_df = self.create_ranked_table(body)

        # Return the ORG with highest freq (at least >= 2)
        if ranked_df.first()[1] >= 2: 
            return ranked_df.first()[0] 
        else:
            return None
        # TO DO: break even - Wikidata#

## Define the Sentiment Identification Class

In [6]:
class SentimentIdentification:

    def __init__(self, MODEL_NAME):
        """Creates a class for sentiment identication using specified model.

        Args:
          MODEL_NAME: Name of the Spark NLP pretrained pipeline.
        """

        # Create the pipeline instance
        self.MODEL_NAME = MODEL_NAME

        if self.MODEL_NAME == "custom_pipeline": # https://nlp.johnsnowlabs.com/2021/11/03/bert_sequence_classifier_finbert_en.html
          document_assembler = DocumentAssembler() \
              .setInputCol('text') \
              .setOutputCol('document')

          tokenizer = Tokenizer() \
              .setInputCols(['document']) \
              .setOutputCol('token')

          sequenceClassifier = BertForSequenceClassification \
                .pretrained('bert_sequence_classifier_finbert', 'en') \
                .setInputCols(['token', 'document']) \
                .setOutputCol('class') \
                .setCaseSensitive(True) \
                .setMaxSentenceLength(512)

          pipeline = Pipeline(stages=[
              document_assembler,
              tokenizer,
              sequenceClassifier
          ])

          self.pipeline_model = LightPipeline(pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

        else:
          self.pipeline_model = PretrainedPipeline(self.MODEL_NAME, lang = 'en')



    def predict(self, text):
        """Predicts sentiment of the input string..

        Args:
          text: String to classify.
        """
        self.text = text

        # Annotate input text using pretrained model
        annotations =  self.pipeline_model.annotate(self.text)

        # Depending on the chosen pipeline the outputs will be slightly different
        if self.MODEL_NAME == "analyze_sentimentdl_glove_imdb":
          # print(f"{annotations['sentiment']} {annotations['document']}")

          if isinstance(self.text, list):
            return [annotation['sentiment'][0] for annotation in annotations] # Return the sentiment list of strings
          else:
            return annotations['sentiment'][0] # Return the sentiment string

        else:
          # print(f"{annotations['class']} {annotations['document']}")

          if isinstance(self.text, list):
            return [annotation['class'][0] for annotation in annotations] # Return the sentiment list of strings
          else:
            return annotations['class'][0] # Return the sentiment string

## Identify Brand in news article


In [11]:
MODEL_NAME = "ner_dl_bert" # MODEL_NAME = "onto_100"

brand_identifier = BrandIdentification(MODEL_NAME)
headline, body = article

brand_by_headline = brand_identifier.predict_by_headline(headline)
print(brand_by_headline)

# Only use article body if no brand identified in the headline
if brand_by_headline == None:
    brand = brand_identifier.predict(body)
    print(brand)

bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
ner_dl_bert download started this may take some time.
Approximate size to download 15.4 MB
[OK!]
Google


## Classify article using chosen pipeline

In [12]:
# identifier = SentimentIdentification(MODEL_NAME =  "analyze_sentimentdl_glove_imdb")
# identifier = SentimentIdentification(MODEL_NAME =  "classifierdl_bertwiki_finance_sentiment_pipeline")
identifier = SentimentIdentification(MODEL_NAME = "custom_pipeline") # Uses https://nlp.johnsnowlabs.com/2021/11/03/bert_sequence_classifier_finbert_en.html

# Predict by headline
headline = article[0]
identifier.predict(headline)

# Predict by body
body = article[1]
identifier.predict(body)


bert_sequence_classifier_finbert download started this may take some time.
Approximate size to download 390.9 MB
[OK!]


'negative'

## Test the accuracy of sentiment using Kaggle data (Financial News Headlines)

## NER - Kaggle Data

### Convert Kaggle data to Pandas dataframe and preprocess

In [13]:
# Load the data from Github
NER_url = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/python-package/main/data/NER_test_data.csv'

# Convert data to Pandas dataframe 
df_NER = pd.read_csv(NER_url).head(500)
df_NER.columns = ['Brand', 'text']

# Shuffle the DataFrame rows
df_NER = df_NER.sample(frac = 1)

# Make dataset smaller for faster runtime
num_sentences = 100
total_num_sentences = df_NER.shape[0]
df_NER.drop(df_NER.index[num_sentences:total_num_sentences], inplace=True)

print(df_NER.shape)

(100, 2)


### Identify the brand in each sentence & compute accuracy



In [14]:
import time

MODEL_NAME = "ner_dl_bert" # MODEL_NAME = "onto_100" / "ner_dl"
brand_identifier = BrandIdentification(MODEL_NAME)

bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
ner_dl_bert download started this may take some time.
Approximate size to download 15.4 MB
[OK!]


In [24]:
## Measure how long to create a ranked table for one headline (string) only
# Randomly select one headline
hl_str = df_NER.iloc[8, 1]
# print(hl_str)

start = time.time()

df, ranked_df = brand_identifier.create_ranked_table(hl_str)
# df.show() 
# ranked_df.show() # Showing both tables takes 5 seconds

mid = time.time()

brand = brand_identifier.predict_by_headline(hl_str)

end = time.time()

print(f"{mid-start} seconds elapsed to create a ranked table for 1 sentence.")
print(f"{end-mid} seconds elapsed to predict a brand for 1 sentence.")

0.4179878234863281 seconds elapsed to create a ranked table for 1 sentence.
3.2677555084228516 seconds elapsed to predict a brand for 1 sentence.


In [22]:
## Measure how long to create a ranked table for a pandas dataframe of 100 headlines
start = time.time()

df, ranked_df = brand_identifier.create_ranked_table(df_NER) # The pandas df is changed to a spark df
# df.show(300, truncate=False) 
# ranked_df.show(100, truncate=False) # Showing both tables takes 40 seconds

end = time.time()

print(f"{end-start} seconds elapsed to create ranked tables for {num_sentences} sentences.")

0.25003647804260254 seconds elapsed to create ranked tables for 100 sentences.


In [36]:
## Measure how long to create a ranked table for the list of 100 headlines
# Create a list of headline strings
hl_list = df_NER['text'].tolist()

start = time.time()

df, ranked_df = brand_identifier.create_ranked_table(hl_list) # The list is first changed into a pandas df, then a spark df
# df.show(300, truncate=False) 
# ranked_df.show(100, truncate=False) # Showing both tables takes 40 seconds

end = time.time()
print(f"{end-start} seconds elapsed to create a ranked table for {num_sentences} sentences.")

0.27132105827331543 seconds elapsed to create a ranked table for 100 sentences.


In [27]:
## Measure how long it takes to create a ranked table, then identify a brand from the table for each headline using for loop
# Create ranked tables for each headline using for loop takes 40 seconds
# What takes long is the process to access each ranked table and identify a brand based on manual rules
start = time.time()

# Use list comprehension to identify a brand for each headline (row)
df_NER['Predicted Brand'] = [brand_identifier.predict_by_headline(hl) for hl in df_NER['text']]

end = time.time()
print(f"{(end-start)} seconds elapsed to classify {num_sentences} sentences.")

# Compute the accuracy
y_true = df_NER['Brand'].to_numpy()
y_pred = df_NER['Predicted Brand'].to_numpy()
print(f"The accuracy is {100* sum(y_true==y_pred)/len(y_true)}%. \n")

pd.set_option("display.max_rows", None)
print(df_NER) # Variation issue in names is rare

609.7073035240173 seconds elapsed to classify 100 sentences.
The accuracy is 22.0%. 

                                Brand  ...                   Predicted Brand
164                              None  ...                              None
12                           Talentum  ...                  Finnish Talentum
34                         Sanoma Oyj  ...                              None
179                              None  ...                               EUR
206                              None  ...                              None
195                       Outotec Oyj  ...                       Outotec Oyj
451                              None  ...                              None
60                              Atria  ...                             Atria
80                               None  ...                              None
408                           Tradeka  ...                           Tradeka
411                              None  ...                         

### Identify all brands using Spark Dataframe of sentences as input (not working)

In [35]:
# Convert the pandas df to spark df
df_spark = spark.createDataFrame(df_NER) 

df_spark = brand_identifier.pipeline_model.transform(df_spark)

# Measure how long it takes
# start = time.time()

# Each row of this spark df contains all identified entities for one sentence
df_spark.select("ner_chunk").show(20, truncate=False) 
print(type(df_spark.select("ner_chunk").first())) # pyspark.sql Row type
# df_spark.printSchema()

# Create ranked table for each row of this spark df - how to do it without a for loop?
# rdd = df_spark.rdd.map(lambda x: 
#     (x['ner_chunk.result'],x["ner_chunk.metadata"]))  # this does not work
# df_spark_brand = rdd.toDF(["chuck","result"])
# df_spark_brand.show()

# end = time.time()

# print(f"{end-start} seconds elapsed to create ranked tables for {num_sentences} sentences.")

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ner_chunk                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

## Sentiment - Kaggle Data

### Convert Kaggle data to Pandas dataframe and preprocess

In [None]:
sentiment_url = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/python-package/main/data/sentiment_test_data.csv'

# Store data in a Pandas Dataframe
df_pandas = pd.read_csv(sentiment_url)

# Change column names (pipelines require a "text" column to predict)
df_pandas.columns = ['True_Sentiment', 'text']

# Shuffle the DataFrame rows
df_pandas = df_pandas.sample(frac = 1)

# Make dataset smaller for faster runtime
num_sentences = 100
total_num_sentences = df_pandas.shape[0]
df_pandas.drop(df_pandas.index[num_sentences:total_num_sentences], inplace=True)

print(df_pandas.shape)


(100, 2)


### Identify the sentiment in each sentence

In [None]:
# Create the identifier object
# identifier = SentimentIdentification(MODEL_NAME = "custom_pipeline") # 90.2% accuracy on 500 sentences 89.8% on 1000 sentences
identifier = SentimentIdentification(MODEL_NAME =  "classifierdl_bertwiki_finance_sentiment_pipeline") # Alternative pretrained pipeline 90.0% accuracy on 500 sentences

preds = []
target = []
ignored_idxs = []
sentiment_to_ignore = "" # e.g. neutral

# Measure how long it takes
start = time.time()

# Collect predicted sentiment for each headline - take three minutes to run
for idx, hl in enumerate(df_pandas['text']):

    # Only append the sentiment if it is not the sentiment to ignore (e.g. neutral)
    target_sentiment = df_pandas["True_Sentiment"][df_pandas.index[idx]]

    if target_sentiment != sentiment_to_ignore:
      preds.append(identifier.predict(hl))
    else:
      ignored_idxs.append(idx)

    # Print progress
    if idx % 25 == 0:
      print(f"Classification {100*idx/num_sentences}% done.")

# Remove all ignored entries from dataset
df_pandas.drop(df_pandas.index[ignored_idxs], inplace=True)

df_pandas['Predicted_Sentiment'] = preds

# Measure how long it takes
end = time.time()
print(f"{end-start} seconds elapsed to classify {num_sentences} sentences.")

# Modify predicted labels to match with true labels
# df = df.replace({'Predicted Sentiment': {'pos' : 'positive', 'neg' : 'negative'}})

df_pandas

classifierdl_bertwiki_finance_sentiment_pipeline download started this may take some time.
Approx size to download 412.9 MB
[OK!]
Classification 0.0% done.
Classification 25.0% done.
Classification 50.0% done.
Classification 75.0% done.
33.950716972351074 seconds elapsed to classify 100 sentences.


Unnamed: 0,True_Sentiment,text,Predicted_Sentiment
3667,neutral,"ADP News - Jan 13 , 2009 - Finnish industrial ...",neutral
2009,positive,Elcoteq 's stock of orders has stabilised in t...,positive
1177,neutral,"As a result of the merger , the largest profes...",neutral
2771,neutral,AffectoGenimap builds highly customised IT sol...,neutral
3903,neutral,The solution is demonstrated on a tablet devel...,neutral
...,...,...,...
595,positive,The cooperation will double The Switch 's conv...,positive
3439,neutral,Union and company officials did not return cal...,neutral
310,positive,"Via the Satlan acquisition , Teleste plans to ...",positive
875,positive,"Profitability ( EBIT % ) was 13.9 % , compared...",positive


### Measure the Accuracy

In [None]:
from sklearn.metrics import classification_report

y_true = df_pandas['True_Sentiment'].to_numpy()
y_pred = df_pandas['Predicted_Sentiment'].to_numpy()

print(f"The accuracy is {100* sum(y_true==y_pred)/len(y_true)}%. \n")

target_names = ['positive', 'neutral', 'negative']

# Compute classification metrics - poor accuracy
print(classification_report(y_true, y_pred, target_names=target_names))

The accuracy is 90.0%. 

              precision    recall  f1-score   support

    positive       1.00      0.93      0.97        15
     neutral       0.90      0.92      0.91        50
    negative       0.86      0.86      0.86        35

    accuracy                           0.90       100
   macro avg       0.92      0.90      0.91       100
weighted avg       0.90      0.90      0.90       100



### Classify using Spark Dataframe as input

In [None]:
from pyspark.sql.functions import array_join
from pyspark.sql.functions import col, explode, expr, greatest
from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id, row_number

# Define pretrained pipeline
pipeline = PretrainedPipeline("classifierdl_bertwiki_finance_sentiment_pipeline", lang = 'en')
# identifier = SentimentIdentification(MODEL_NAME = "custom_pipeline")

# Convert to spark dataframe for faster prediction
df_spark = spark.createDataFrame(df_pandas) 

# Measure how long it takes
start = time.time()

# Predict the sentiment
df_spark = pipeline.transform(df_spark)

# # df_spark = identifier.pipeline_model.transform(df_spark)


# print(df_spark.first()['class'])
# df_spark.printSchema()

#Extract sentiment score
df_spark_scores = df_spark.select(explode(col("class.metadata")).alias("metadata")).select(col("metadata")["positive"].alias("positive"),
                                                                                    col("metadata")["neutral"].alias("neutral"),
                                                                                    col("metadata")["negative"].alias("negative"))

# df_spark_scores = df_spark_scores.withColumn('max_val', greatest('positive', 'negative', 'neutral')) # Doesn't work because of scientific notation

# Extract only targets and labels
df_spark = df_spark.select("text", "True_Sentiment", "class.result")


# # df_spark_no_text = df_spark.select("True_Sentiment", "result")
# # df_spark_no_text = df_spark_no_text.withColumn("Predicted_Sentiment", array_join("result", ""))

# Rename to Predicted Sentiment
df_spark = df_spark.withColumnRenamed("result","Predicted_Sentiment")

# Convert sentiment from a list to a string
df_spark = df_spark.withColumn("Predicted_Sentiment", array_join("Predicted_Sentiment", ""))

# Merge the predictions and the confidence scores

# Add temporary column index to join
w = Window.orderBy(monotonically_increasing_id())
df_spark_with_index =  df_spark.withColumn("columnindex", row_number().over(w))
df_spark_scores_with_index =  df_spark_scores.withColumn("columnindex", row_number().over(w))

# Join the predictions and the scores in one dataframe
df_spark_with_index = df_spark_with_index.join(df_spark_scores_with_index,
                         df_spark_with_index.columnindex == df_spark_scores_with_index.columnindex,
                         'inner').drop(df_spark_scores_with_index.columnindex)

# Remove the index column
df_spark_combined = df_spark_with_index.drop(df_spark_with_index.columnindex)

# Convert to pandas dataframe for postprocessing (https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32)
df_pandas_postprocessed = df_spark_combined.toPandas()
# df_pandas_postprocessed = df_spark.toPandas()

# df_pandas["Predicted_Sentiment"] = df_pandas["Predicted_Sentiment"].apply(lambda x: x[0]) # Alternative to convert list to string

end = time.time()

print(f"{end-start} seconds elapsed to classify {num_sentences} sentences.")

# df_pandas_post_processed


classifierdl_bertwiki_finance_sentiment_pipeline download started this may take some time.
Approx size to download 412.9 MB
[OK!]
26.504663228988647 seconds elapsed to classify 100 sentences.


### Compute the Accuracy

In [None]:
from sklearn.metrics import classification_report, accuracy_score

# Compute the accuracy
accuracy = accuracy_score(df_pandas["True_Sentiment"], df_pandas["Predicted_Sentiment"])
print(f"The accuracy is {accuracy*100}%.")
print(classification_report(df_pandas["True_Sentiment"], df_pandas["Predicted_Sentiment"]))

# Alternatively if not converted to pandas dataframe, use the following for the accuracy
# Compute accuracy by comparing each true label with predicted label
start = time.time()
accuracy = df_spark.filter(df_spark.Predicted_Sentiment == df_spark.True_Sentiment).count()/ num_sentences
end = time.time()
print(f"{end-start} seconds elapsed to calculate accuracy of {num_sentences} sentences.")
print(f"The accuracy is {accuracy*100}%.")

The accuracy is 90.0%.
              precision    recall  f1-score   support

    negative       1.00      0.93      0.97        15
     neutral       0.90      0.92      0.91        50
    positive       0.86      0.86      0.86        35

    accuracy                           0.90       100
   macro avg       0.92      0.90      0.91       100
weighted avg       0.90      0.90      0.90       100

28.48450493812561 seconds elapsed to calculate accuracy of 100 sentences.
The accuracy is 90.0%.


### Alternatively extract predictions as strings (takes much longer)

In [None]:
# # Extract the predictions from the dataframe
# annotations_list = result.select("class.result").collect()
# sentiment_list = [annotations_list[i].result[0] for i in range(num_sentences)]

# # Annotate previous dataframe for visualization
# df_pandas['Predicted Sentiment'] = sentiment_list

# # Move text column to the beginning
# text_column = df_pandas.pop('text')
# df_pandas.insert(0, 'Headline', text_column)

# display(df_pandas)

# y_true = df_pandas['True Sentiment'].to_numpy()
# y_pred = df_pandas['Predicted Sentiment'].to_numpy()

# print(f"The accuracy is {100* sum(y_true==y_pred)/len(y_true)}%. \n")