<a href="https://colab.research.google.com/github/Brand-Sentiment-Tracking/dev-sentiment-package/blob/main/johnsnow/Integrated_John_Snow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Class for Sentiment Analysis for News Articles**

## Colab Setup

In [1]:
# Install PySpark and Spark NLP
! pip install -q pyspark==3.1.2 spark-nlp

# Install Spark NLP Display lib
! pip install --upgrade -q spark-nlp-display

[K     |████████████████████████████████| 212.4 MB 64 kB/s 
[K     |████████████████████████████████| 142 kB 27.0 MB/s 
[K     |████████████████████████████████| 198 kB 38.7 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 95 kB 2.7 MB/s 
[K     |████████████████████████████████| 66 kB 4.8 MB/s 
[?25h

In [2]:
import sparknlp
import pandas as pd
import random
import time
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType
import pyspark.sql.functions as F
from tabulate import tabulate
import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from sparknlp_display import NerVisualizer

# spark = sparknlp.start(gpu=False)
spark = sparknlp.start(gpu=True)

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.4.2
Apache Spark version:  3.1.2


## Define a News Article

In [3]:
article = [ # two strings - headline & article body
"""Google sued in US over 'deceptive' location tracking""", # headline
"""Google is being sued in the US over accusations it deceived people about how to control location tracking.

The legal action refers to a widely reported 2018 revelation turning off one location-tracking setting in its apps was insufficient to fully disable the feature.

It accuses Google of using so-called dark patterns, marketing techniques that deliberately confuse.

Google said the claims were inaccurate and outdated.

'Unfair practices'
The legal action was filed in the District of Columbia. Similar ones were also filed in Texas, Indiana and Washington state.

It refers to an Associated Press revelation turning off Location History when using Google Maps or Search was insufficient - as a separate setting, Web and App Activity, continued to log location and other personal data.

The study, with researchers at Princeton University, found up to two billion Android and Apple devices could be affected.

"Google has relied on, and continues to rely on, deceptive and unfair practices that make it difficult for users to decline location tracking or to evaluate the data collection and processing to which they are purportedly consenting," the legal action alleges.

'Robust controls'
Google told BBC News the case was based "on inaccurate claims and outdated assertions about our settings".

A representative added: "We have always built privacy features into our products and provided robust controls for location data.

"We will vigorously defend ourselves and set the record straight."

Visual misdirection
The legal action claims Google's policies contained other "misleading, ambiguous and incomplete descriptions... but guarantee that consumers will not understand when their location is collected and retained by Google or for what purposes".

It refers to dark patterns, design choices that alter users' decision-making for the designer's benefit - such as, complicated navigation menus, visual misdirection, confusing wording and repeated nudging towards a particular outcome.

Data regulators are increasingly focusing on these practices.

Google faces a raft of other legal actions in the US, including:

In May 2020, Arizona filed a legal action over the same issue
In December 2020, multiple US states sued over the price and process of advertising auctions
In October 2020, the US Justice Department alleged Google had a monopoly over search and search advertising"""]



## Define the Brand Identification Class

In [4]:
def get_brand(row_list):
    if not row_list: # If the list is empty
        return "None"

    else:
        # Create a pandas df with entity names and types
        data = [[row.result, row.metadata['entity']] for row in row_list]
        df_pd = pd.DataFrame(data, columns = ['Entity', 'Type'])
      
        # Filter only ORGs
        df_pd = df_pd[df_pd["Type"] == "ORG"]

        # Rank the ORGs by frequencies
        ranked_df = df_pd["Entity"].value_counts() # a Pandas Series object
            
        # If no ORG identified in headline, return None
        if len(ranked_df.index) == 0:
           return "None"

        # If only one ORG appears in headline, return it
        elif len(ranked_df.index) == 1:
           return ranked_df.index[0]

        # If one ORG appear more than the others, return that one 
        elif ranked_df[0] > ranked_df[1]:
            return ranked_df.index[0] 

        else: # If multiple ORGs appear the same time, return randomly (TO BE MODIFIED)
            return random.choice([ranked_df.index[0], ranked_df.index[1]])
            # TO DO: break even - Wikidata for article body #

In [5]:
class BrandIdentification:
    def __init__(self, MODEL_NAME):
        self.MODEL_NAME = MODEL_NAME

        # Define Spark NLP pipeline 
        documentAssembler = DocumentAssembler() \
            .setInputCol('text') \
            .setOutputCol('document')

        tokenizer = Tokenizer() \
            .setInputCols(['document']) \
            .setOutputCol('token')

        # ner_dl and onto_100 model are trained with glove_100d, so the embeddings in the pipeline should match
        if (self.MODEL_NAME == "ner_dl") or (self.MODEL_NAME == "onto_100"):
            embeddings = WordEmbeddingsModel.pretrained('glove_100d') \
                .setInputCols(["document", 'token']) \
                .setOutputCol("embeddings")

        # Bert model uses Bert embeddings
        elif self.MODEL_NAME == "ner_dl_bert":
            embeddings = BertEmbeddings.pretrained(name='bert_base_cased', lang='en') \
                .setInputCols(['document', 'token']) \
                .setOutputCol('embeddings')

        ner_model = NerDLModel.pretrained(MODEL_NAME, 'en') \
            .setInputCols(['document', 'token', 'embeddings']) \
            .setOutputCol('ner')

        ner_converter = NerConverter() \
            .setInputCols(['document', 'token', 'ner']) \
            .setOutputCol('ner_chunk')

        nlp_pipeline = Pipeline(stages=[
            documentAssembler, 
            tokenizer,
            embeddings,
            ner_model,
            ner_converter
        ])
        
        # Create the pipeline model
        empty_df = spark.createDataFrame([['']]).toDF('text') # An empty df with column name "text"
        self.pipeline_model = nlp_pipeline.fit(empty_df)


    def predict_brand(self, text): # text could be a pandas dataframe or a Spark dataframe (with a column "text"), a list of strings
        # Run the pipeline for the text
        if isinstance(text, pd.DataFrame): 
            text_df = spark.createDataFrame(text) # If input a pandas dataframe
        elif isinstance(text, list): 
            text_df = spark.createDataFrame(pd.DataFrame({'text': text})) # If input a list of strings
        elif isinstance(text, str): 
            text_df = spark.createDataFrame(pd.DataFrame({'text': text}, index=[0])) # If input a single string
        else: text_df = text

        df_spark = self.pipeline_model.transform(text_df) 

        # Improve speed of identification using Spark User-defined function
        pred_brand = F.udf(lambda z: get_brand(z), StringType()) # Output a string
        # spark.udf.register("pred_brand", pred_brand)

        df_spark_combined = df_spark.withColumn('Predicted_brand', pred_brand('ner_chunk'))
        df_spark_combined = df_spark_combined.select("text", "Predicted_brand")
        # df_spark_combined.show(100)
        
        # Remove all rows with no brands detected
        df_spark_final=df_spark_combined.filter(df_spark_combined.Predicted_brand != 'None')
        df_spark_final.show(100)

        return df_spark_final


## Define the Senitment Identification Class

In [6]:
class SentimentIdentification:

    def __init__(self, MODEL_NAME):
        """Creates a class for sentiment identication using specified model.

        Args:
          MODEL_NAME: Name of the Spark NLP pretrained pipeline.
        """

        # Create the pipeline instance
        self.MODEL_NAME = MODEL_NAME

          # Create a custom pipline if requested
        if self.MODEL_NAME == "custom_pipeline": # https://nlp.johnsnowlabs.com/2021/11/03/bert_sequence_classifier_finbert_en.html
            document_assembler = DocumentAssembler() \
                .setInputCol('text') \
                .setOutputCol('document')

            tokenizer = Tokenizer() \
                .setInputCols(['document']) \
                .setOutputCol('token')

            sequenceClassifier = BertForSequenceClassification \
                  .pretrained('bert_sequence_classifier_finbert', 'en') \
                  .setInputCols(['token', 'document']) \
                  .setOutputCol('class') \
                  .setCaseSensitive(True) \
                  .setMaxSentenceLength(512)

            pipeline = Pipeline(stages=[
                document_assembler,
                tokenizer,
                sequenceClassifier
            ])

            self.pipeline_model = pipeline.fit(spark.createDataFrame([['']]).toDF("text"))

        else:
            self.pipeline_model = PretrainedPipeline(self.MODEL_NAME, lang = 'en')


    def predict_string_list(self, string_list):
        """Predicts sentiment of the input list of strings.

        Args:
          string_list: List of strings to classify.
        """
 
        # Annotate input text using pretrained model

        if self.MODEL_NAME == "custom_pipeline":
            pipeline_annotator = LightPipeline(self.pipeline_model) # Convert the pipeline to an annotator
        else:
            pipeline_annotator = self.pipeline_model

        annotations =  pipeline_annotator.annotate(string_list)

        return [annotation['class'][0] for annotation in annotations] # Return the sentiment list of strings


    def predict_dataframe(self, df):
        """Annotates the input dataframe with the classification results.

        Args:
          df : Pandas or Spark dataframe to classify (must contain a "text" column)
        """

        if isinstance(df, pd.DataFrame):
            # Convert to spark dataframe for faster prediction
            df_spark = spark.createDataFrame(df) 
        else:
            df_spark = df

        # Annotate dataframe with classification results
        df_spark = self.pipeline_model.transform(df_spark)

        #Extract sentiment score
        df_spark_scores = df_spark.select(explode(col("class.metadata")).alias("metadata")).select(col("metadata")["positive"].alias("positive"),
                                                                                            col("metadata")["neutral"].alias("neutral"),
                                                                                            col("metadata")["negative"].alias("negative"))

        # Extract only target and label columns
        df_spark = df_spark.select("text", "True_Sentiment", "class.result")

        # Rename to result column to Predicted Sentiment
        df_spark = df_spark.withColumnRenamed("result", "Predicted_Sentiment")

        # Convert sentiment from a list to a string
        df_spark = df_spark.withColumn("Predicted_Sentiment", array_join("Predicted_Sentiment", ""))

        # Join the predictions dataframe to the scores dataframe
        # Add temporary column index to join
        w = Window.orderBy(monotonically_increasing_id())
        df_spark_with_index =  df_spark.withColumn("columnindex", row_number().over(w))
        df_spark_scores_with_index =  df_spark_scores.withColumn("columnindex", row_number().over(w))

        # Join the predictions and the scores in one dataframe
        df_spark_with_index = df_spark_with_index.join(df_spark_scores_with_index,
                                df_spark_with_index.columnindex == df_spark_scores_with_index.columnindex,
                                'inner').drop(df_spark_scores_with_index.columnindex)

        # Remove the index column
        df_spark_combined = df_spark_with_index.drop(df_spark_with_index.columnindex)

        # Convert to pandas dataframe for postprocessing (https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32)
        df_pandas_postprocessed = df_spark_combined.toPandas()

        return df_pandas_postprocessed


    def compute_accuracy(self, df_pandas_postprocessed):
        """Computes accuracy by comparing labels of input dataframe.

        Args:
          df_pandas_postprocessed: pandas dataframe containing "True_Sentiment" and "Predicted_Sentiment" columns
        """
    
        from sklearn.metrics import classification_report, accuracy_score

        # Compute the accuracy
        accuracy = accuracy_score(df_pandas_postprocessed["True_Sentiment"], df_pandas_postprocessed["Predicted_Sentiment"])
        accuracy *= 100
        classification_report = classification_report(df_pandas_postprocessed["True_Sentiment"], df_pandas_postprocessed["Predicted_Sentiment"])

        # Alternatively if the input is a postprocessed spark dataframe
        # Compute accuracy by comparing each true label with predicted label
        # accuracy = df_spark.filter(df_spark.Predicted_Sentiment == df_spark.True_Sentiment).count()/ num_sentences

        return accuracy, classification_report

## Identify Brand in news article


In [7]:
MODEL_NAME = "ner_dl_bert" # MODEL_NAME = "onto_100"
brand_identifier = BrandIdentification(MODEL_NAME)

bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
ner_dl_bert download started this may take some time.
Approximate size to download 15.4 MB
[OK!]


In [8]:
headline, body = article

brand_by_headline = brand_identifier.predict_brand(headline)

# Only use article body if no brand identified in the headline
if brand_by_headline.count() == 0:
    brand_by_body = brand_identifier.predict_brand(body)

+--------------------+---------------+
|                text|Predicted_brand|
+--------------------+---------------+
|Google sued in US...|         Google|
+--------------------+---------------+



In [9]:
list_of_hl = ["Bad news for Google", "Tesla went bankrupt today."]
brands = brand_identifier.predict_brand(list_of_hl)

+--------------------+---------------+
|                text|Predicted_brand|
+--------------------+---------------+
| Bad news for Google|         Google|
|Tesla went bankru...|          Tesla|
+--------------------+---------------+



## Classify article using chosen pipeline

In [10]:
# identifier = SentimentIdentification(MODEL_NAME =  "analyze_sentimentdl_glove_imdb")
# identifier = SentimentIdentification(MODEL_NAME =  "classifierdl_bertwiki_finance_sentiment_pipeline")
identifier = SentimentIdentification(MODEL_NAME = "custom_pipeline") # Uses https://nlp.johnsnowlabs.com/2021/11/03/bert_sequence_classifier_finbert_en.html

# identifier_pretrained = SentimentIdentification(MODEL_NAME = "classifierdl_bertwiki_finance_sentiment_pipeline")
identifier_pretrained = SentimentIdentification(MODEL_NAME = "custom_pipeline")

identifier_pretrained.predict_string_list([headline, body])


bert_sequence_classifier_finbert download started this may take some time.
Approximate size to download 390.9 MB
[OK!]
bert_sequence_classifier_finbert download started this may take some time.
Approximate size to download 390.9 MB
[OK!]


['negative', 'negative']

## Test the accuracy of sentiment using the Financial News Headline Dataset

## NER 

### Convert Kaggle data to Pandas dataframe and preprocess

In [11]:
# Load the data from Github
NER_url = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/python-package/main/data/NER_test_data.csv'

# Convert csv data to Pandas dataframe 
df_NER = pd.read_csv(NER_url, header=None).head(500) # 'header=None' prevents pandas eating the first row as headers
df_NER.columns = ['Brand', 'text']

# Shuffle the DataFrame rows
# df_NER = df_NER.sample(frac = 1)

# Make dataset smaller for faster runtime
num_sentences = 10
total_num_sentences = df_NER.shape[0]
df_NER.drop(df_NER.index[num_sentences:total_num_sentences], inplace=True)



# Alternatively, create a preprocessed spark dataframe from csv
from pyspark import SparkFiles
spark.sparkContext.addFile(NER_url)

# Read raw dataframe
df_spark_org = spark.read.csv("file://"+SparkFiles.get("NER_test_data.csv"))

# Rename columns
df_spark_org = df_spark_org.withColumnRenamed("_c0", "Brand").withColumnRenamed("_c1", "text")
df_spark_org = df_spark_org.limit(num_sentences)

### Identify the brand in each sentence & compute accuracy



In [12]:
MODEL_NAME = "ner_dl_bert" # MODEL_NAME = "onto_100" / "ner_dl"
brand_identifier = BrandIdentification(MODEL_NAME)

bert_base_cased download started this may take some time.
Approximate size to download 389.1 MB
[OK!]
ner_dl_bert download started this may take some time.
Approximate size to download 15.4 MB
[OK!]


### Identify all brands using Spark Dataframe of sentences as input 

In [13]:
brand_identifier.predict_brand(df_NER)
brand_identifier.predict_brand(df_spark_org)

+--------------------+---------------+
|                text|Predicted_brand|
+--------------------+---------------+
|According to Gran...|           Gran|
|Technopolis plans...|    Technopolis|
|The international...|      Postimees|
|According to the ...|        Basware|
|FINANCING OF ASPO...|       ASPOCOMP|
|For the last quar...|     Componenta|
|In the third quar...|            EUR|
+--------------------+---------------+

+--------------------+---------------+
|                text|Predicted_brand|
+--------------------+---------------+
|According to Gran...|           Gran|
|Technopolis plans...|    Technopolis|
|The international...|      Postimees|
|According to the ...|        Basware|
|FINANCING OF ASPO...|GROWTH Aspocomp|
|For the last quar...|     Componenta|
|In the third quar...|            EUR|
+--------------------+---------------+



DataFrame[text: string, Predicted_brand: string]

In [14]:
# Improve speed of identification using Spark User-defined function
dataframe_type = "Spark"

if dataframe_type == "Pandas": df_spark_org = spark.createDataFrame(df_NER)  # Only keep the 'text' column
df_spark = brand_identifier.pipeline_model.transform(df_spark_org)

start = time.time()

pred_brand = F.udf(lambda z: get_brand(z), StringType()) # Output a string
# spark.udf.register("pred_brand", pred_brand)

def get_brand(row_list):
    if not row_list: # If the list is empty
        return "None"

    else:
        # Create a pandas df with entity names and types
        data = [[row.result, row.metadata['entity']] for row in row_list]
        df_pd = pd.DataFrame(data, columns = ['Entity', 'Type'])
  
        # Filter only ORGs
        df_pd = df_pd[df_pd["Type"] == "ORG"]

        # Rank the ORGs by frequencies
        ranked_df = df_pd["Entity"].value_counts() # a Pandas Series object
        
        # If no ORG identified in headline, return None
        if len(ranked_df.index) == 0:
           return "None"

        # If only one ORG appears in headline, return it
        elif len(ranked_df.index) == 1:
           return ranked_df.index[0]

        # If one ORG appear more than the others, return that one 
        elif ranked_df[0] > ranked_df[1]:
            return ranked_df.index[0] 

        else: # If multiple ORGs appear the same time, return randomly (TO BE MODIFIED)
            return random.choice([ranked_df.index[0], ranked_df.index[1]])

# pred_brand_col = pred_brand(df_spark.ner_chunk)
df_spark_combined = df_spark.withColumn('Predicted Brand', pred_brand('ner_chunk'))
df_spark_final = df_spark_combined.select("Brand", "Predicted Brand")
df_spark_final.show(100)

end = time.time()

print(f"{end-start} seconds elapsed to create ranked tables for {num_sentences} sentences.")

+------------+---------------+
|       Brand|Predicted Brand|
+------------+---------------+
|        None|           Gran|
|Technopolis |    Technopolis|
|     Elcoteq|        Elcoteq|
|        None|           None|
|     Basware|        Basware|
|    Aspocomp|       ASPOCOMP|
|  Componenta|     Componenta|
|        None|            EUR|
|        None|           None|
|        None|           None|
+------------+---------------+

1.8998210430145264 seconds elapsed to create ranked tables for 10 sentences.


In [15]:
# Compute the accuracy
df_pd_post = df_spark_final.toPandas()

y_true = df_pd_post['Brand'].to_numpy()
y_pred = df_pd_post['Predicted Brand'].to_numpy()
print(f"The accuracy is {100*sum(y_true==y_pred)/len(y_true)}%. \n")

The accuracy is 50.0%. 



## Sentiment

### Load Sentiment Test data

In [37]:
# Convert Kaggle data to Pandas dataframe and preprocess
import time

sentiment_url = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/python-package/main/data/sentiment_test_data.csv'

# Store data in a Pandas Dataframe
df_pandas = pd.read_csv(sentiment_url, header=None)

# Change column names (pipelines require a "text" column to predict)
df_pandas.columns = ['True_Sentiment', 'text']

# Shuffle the DataFrame rows
# df_pandas = df_pandas.sample(frac = 1)

# Make dataset smaller for faster runtime
num_sentences = 30
total_num_sentences = df_pandas.shape[0]
df_pandas.drop(df_pandas.index[num_sentences:total_num_sentences], inplace=True)

print(df_pandas.shape)

# Create a preprocessed spark dataframe
from pyspark import SparkFiles
spark.sparkContext.addFile(sentiment_url)

# Read raw dataframe
df_spark = spark.read.csv("file://"+SparkFiles.get("sentiment_test_data.csv"))

# Rename columns
df_spark = df_spark.withColumnRenamed("_c0", "True_Sentiment").withColumnRenamed("_c1", "text")
df_spark = df_spark.limit(num_sentences)

(30, 2)


### Classify using Pandas Dataframe as input

In [38]:
from pyspark.sql.functions import array_join
from pyspark.sql.functions import col, explode, expr, greatest
from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id, row_number

# Create identifier
identifier_pretrained = SentimentIdentification(MODEL_NAME = "classifierdl_bertwiki_finance_sentiment_pipeline")
# identifier_pretrained = SentimentIdentification(MODEL_NAME = "custom_pipeline")

start = time.time()
df_pandas_postprocessed = identifier_pretrained.predict_dataframe(df_pandas)
end = time.time()

print(f"{end-start} seconds elapsed to classify {num_sentences} sentences.")

display(df_pandas_postprocessed)

# Print accuracy metrics
accuracy, report = identifier_pretrained.compute_accuracy(df_pandas_postprocessed)
print(accuracy)
print(report)

classifierdl_bertwiki_finance_sentiment_pipeline download started this may take some time.
Approx size to download 412.9 MB
[OK!]
2.2672581672668457 seconds elapsed to classify 30 sentences.


Unnamed: 0,text,True_Sentiment,Predicted_Sentiment,positive,neutral,negative
0,"According to Gran , the company has no plans t...",neutral,neutral,0.0002473711,0.999752,5.7069025e-07
1,Technopolis plans to develop in stages an area...,neutral,neutral,5.652343e-09,1.0,7.506982e-10
2,The international electronic industry company ...,negative,negative,6.0032686e-05,1.2689318e-05,0.9999273
3,With the new production plant the company woul...,positive,positive,0.99999917,5.6210104e-07,2.830157e-07
4,According to the company 's updated strategy f...,positive,positive,0.99997413,2.5439595e-05,4.7391094e-07
5,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...,positive,positive,0.99249285,0.007490357,1.6831207e-05
6,"For the last quarter of 2010 , Componenta 's n...",positive,positive,0.99999535,9.2654466e-07,3.7480138e-06
7,"In the third quarter of 2010 , net sales incre...",positive,positive,0.99799263,4.4438417e-05,0.0019629349
8,Operating profit rose to EUR 13.1 mn from EUR ...,positive,positive,0.99989295,3.2884018e-06,0.000103695835
9,"Operating profit totalled EUR 21.1 mn , up fro...",positive,positive,0.8207587,0.024473703,0.15476753


93.33333333333333
              precision    recall  f1-score   support

    negative       1.00      1.00      1.00         1
     neutral       0.50      1.00      0.67         2
    positive       1.00      0.93      0.96        27

    accuracy                           0.93        30
   macro avg       0.83      0.98      0.88        30
weighted avg       0.97      0.93      0.94        30



# Predict using Spark Dataframe Input

In [39]:
# Create identifier
identifier_pretrained = SentimentIdentification(MODEL_NAME = "classifierdl_bertwiki_finance_sentiment_pipeline")
# identifier_pretrained = SentimentIdentification(MODEL_NAME = "custom_pipeline")

start = time.time()
# df_pandas_postprocessed = identifier_pretrained.predict_sp_dataframe(df_spark)
df_pandas_postprocessed = identifier_pretrained.predict_dataframe(df_spark)
end = time.time()

print(f"{end-start} seconds elapsed to classify {num_sentences} sentences.")

display(df_pandas_postprocessed)

classifierdl_bertwiki_finance_sentiment_pipeline download started this may take some time.
Approx size to download 412.9 MB
[OK!]
2.2867414951324463 seconds elapsed to classify 30 sentences.


Unnamed: 0,text,True_Sentiment,Predicted_Sentiment,positive,neutral,negative
0,"According to Gran , the company has no plans t...",neutral,neutral,0.0002473711,0.999752,5.7069025e-07
1,Technopolis plans to develop in stages an area...,neutral,neutral,5.652343e-09,1.0,7.506982e-10
2,The international electronic industry company ...,negative,negative,6.0032686e-05,1.2689318e-05,0.9999273
3,With the new production plant the company woul...,positive,positive,0.99999917,5.6210104e-07,2.830157e-07
4,According to the company 's updated strategy f...,positive,positive,0.99997413,2.5439595e-05,4.7391094e-07
5,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...,positive,positive,0.99249285,0.007490357,1.6831207e-05
6,"For the last quarter of 2010 , Componenta 's n...",positive,positive,0.99999535,9.2654466e-07,3.7480138e-06
7,"In the third quarter of 2010 , net sales incre...",positive,positive,0.99799263,4.4438417e-05,0.0019629349
8,Operating profit rose to EUR 13.1 mn from EUR ...,positive,positive,0.99989295,3.2884018e-06,0.000103695835
9,"Operating profit totalled EUR 21.1 mn , up fro...",positive,positive,0.8207587,0.024473703,0.15476753


### Identify the sentiment in each sentence one by one

In [36]:
# Create the identifier object
# identifier = SentimentIdentification(MODEL_NAME = "custom_pipeline") # 90.2% accuracy on 500 sentences 89.8% on 1000 sentences
# identifier = SentimentIdentification(MODEL_NAME =  "classifierdl_bertwiki_finance_sentiment_pipeline") # Alternative pretrained pipeline 90.0% accuracy on 500 sentences
identifier = SentimentIdentification(MODEL_NAME =  "classifierdl_bertwiki_finance_sentiment_pipeline") # Alternative pretrained pipeline 90.0% accuracy on 500 sentences



preds = []
target = []
ignored_idxs = []
sentiment_to_ignore = "" # e.g. neutral

# Measure how long it takes
start = time.time()

# Collect predicted sentiment for each headline - take three minutes to run
for idx, hl in enumerate(df_pandas['text']):

    # Only append the sentiment if it is not the sentiment to ignore (e.g. neutral)
    target_sentiment = df_pandas["True_Sentiment"][df_pandas.index[idx]]

    if target_sentiment != sentiment_to_ignore:
      preds.append(identifier.predict_string_list([hl])[0])
    else:
      ignored_idxs.append(idx)

    # Print progress
    if idx % 25 == 0:
      print(f"Classification {100*idx/num_sentences}% done.")

# Remove all ignored entries from dataset
df_pandas.drop(df_pandas.index[ignored_idxs], inplace=True)

df_pandas['Predicted_Sentiment'] = preds

# Measure how long it takes
end = time.time()
print(f"{end-start} seconds elapsed to classify {num_sentences} sentences.")

# Modify predicted labels to match with true labels
# df = df.replace({'Predicted Sentiment': {'pos' : 'positive', 'neg' : 'negative'}})

df_pandas

classifierdl_bertwiki_finance_sentiment_pipeline download started this may take some time.
Approx size to download 412.9 MB
[OK!]
Classification 0.0% done.
0.7695424556732178 seconds elapsed to classify 10 sentences.


Unnamed: 0,True_Sentiment,text,Predicted_Sentiment
0,neutral,"According to Gran , the company has no plans t...",neutral
1,neutral,Technopolis plans to develop in stages an area...,neutral
2,negative,The international electronic industry company ...,negative
3,positive,With the new production plant the company woul...,positive
4,positive,According to the company 's updated strategy f...,positive
5,positive,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...,positive
6,positive,"For the last quarter of 2010 , Componenta 's n...",positive
7,positive,"In the third quarter of 2010 , net sales incre...",positive
8,positive,Operating profit rose to EUR 13.1 mn from EUR ...,positive
9,positive,"Operating profit totalled EUR 21.1 mn , up fro...",positive


### Measure the Accuracy

In [40]:
from sklearn.metrics import classification_report

y_true = df_pandas_postprocessed['True_Sentiment'].to_numpy()
y_pred = df_pandas_postprocessed['Predicted_Sentiment'].to_numpy()


print(f"The accuracy is {100* sum(y_true==y_pred)/len(y_true)}%. \n")

target_names = ['positive', 'neutral', 'negative']

# Compute classification metrics - poor accuracy
print(classification_report(y_true, y_pred))#, target_names=target_names))

The accuracy is 93.33333333333333%. 

              precision    recall  f1-score   support

    negative       1.00      1.00      1.00         1
     neutral       0.50      1.00      0.67         2
    positive       1.00      0.93      0.96        27

    accuracy                           0.93        30
   macro avg       0.83      0.98      0.88        30
weighted avg       0.97      0.93      0.94        30

