<a href="https://colab.research.google.com/github/Brand-Sentiment-Tracking/dev-sentiment-package/blob/main/johnsnow/Train_sentiment_classifier_with_our_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP
### Multi-class Sentiment Classification
#### By using SentimentDL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/classification/SentimentDL_train_multiclass_sentiment_classifier.ipynb)

Only run this block if you are inside Google Colab otherwise skip it

In [21]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-04-26 15:21:38--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-04-26 15:21:38--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-04-26 15:21:39--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:44

In [22]:
import sparknlp

spark = sparknlp.start(gpu=True)

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version", spark.version)

Spark NLP version:  3.4.3
Apache Spark version 3.0.3


### Load political dataset

In [23]:
# Convert Kaggle data to Pandas dataframe and preprocess
import pandas as pd
from pyspark.sql.functions import rand 

sentiment_url_1 = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/python-package/main/data/sentiment_test_data.csv' # Financial Phrase Bank
sentiment_url_2 = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/python-package/main/data/SEN_en_R.csv' # SEN data
sentiment_url_3 = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/dev-sentiment-package/main/data/labelled_1.csv' # owr extracted data



# # Create a preprocessed spark dataframe
# from pyspark import SparkFiles
# spark.sparkContext.addFile(sentiment_url_3)

# # Read financial dataframe
# df_spark = spark.read.option("header","false").csv("file://"+SparkFiles.get("sentiment_test_data.csv"))
# df_spark = df_spark.withColumnRenamed("_c0", "label").withColumnRenamed("_c1", "text")
# # Shuffle dataset
# df_spark = df_spark.orderBy(rand(2))

# # Read political dataframe
# df_spark = spark.read.option("header","true").csv("file://"+SparkFiles.get("SEN_en_R.csv"))
# df_spark = df_spark.withColumnRenamed("_c0", "idx").withColumnRenamed("headline", "text") \
#                                                  .withColumnRenamed("majority_label", 'label')
# # Replace abbreviations with full lables and filter the unk datapoints (applies only to political dataset)                                         
# df_spark = df_spark.replace("neutr", "neutral").replace("pos", "positive").replace("neg","negative")
# df_spark = df_spark[df_spark.label != "unk"]

# # Read our own dataframe

cols_to_read = ['text',"sentiment (Max's take)"]
df_pandas = pd.read_csv(sentiment_url_3, usecols=cols_to_read)

# Rename sentiment to True_Sentiment
df_pandas.rename(columns={"sentiment (Max's take)":"True_Sentiment"},inplace=True)

# # Make dataset smaller for faster runtime
num_sentences = 500
total_num_sentences = df_pandas.shape[0]
df_pandas.drop(df_pandas.index[num_sentences:total_num_sentences], inplace=True)

# Replace 1, 2 , 3 with negative, neutral, positive
df_pandas["True_Sentiment"].replace({1.0: "negative", 2.0: "neutral", 3.0: "positive"}, inplace=True)

# Downsample majority class neutral
negative_points = df_pandas[df_pandas["True_Sentiment"] == 'negative'].head(50)
neutral_points = df_pandas[df_pandas["True_Sentiment"] == 'neutral'].head(100)
positive_points = df_pandas[df_pandas["True_Sentiment"] == 'positive'].head(50)

print(len(neutral_points))
print(len(positive_points))
print(len(negative_points))

# Downsample negative class

# Concatenate
df_pandas = pd.concat([negative_points, neutral_points, positive_points])

# Shuffle
df_pandas = df_pandas.sample(frac=1)

# # Convert to spark dataframe  
df_spark = spark.createDataFrame(df_pandas)

df_spark.show()


100
50
50
+--------------------+--------------+
|                text|True_Sentiment|
+--------------------+--------------+
|Slovak former pre...|      negative|
|Liberty Latin Ame...|      positive|
|After-school ‘Sat...|       neutral|
|CPI figures highl...|       neutral|
|Veeva Vault CTMS ...|      positive|
|Shock as Linda No...|      negative|
|Putin Calls off P...|       neutral|
|BGBS 2022 saw hig...|      positive|
|Investigation lau...|       neutral|
|Remembering Bedfo...|       neutral|
|CA WFO LOS ANGELE...|       neutral|
|Phoenix New Media...|      negative|
|Hong Kong reopens...|      positive|
|‘Corpse After Cor...|      negative|
|What makes Alabam...|      positive|
|Israeli settlers ...|       neutral|
|United Airlines l...|       neutral|
|Ambulance crews c...|      negative|
|Welltower to Part...|       neutral|
|Insteel Industrie...|       neutral|
+--------------------+--------------+
only showing top 20 rows



In [24]:
# Split into train and test
df_spark_train = df_spark.limit(round(df_spark.count()*0.8))
df_spark_test = df_spark.subtract(df_spark_train)

df_spark_train.show()

+--------------------+--------------+
|                text|True_Sentiment|
+--------------------+--------------+
|Slovak former pre...|      negative|
|Liberty Latin Ame...|      positive|
|After-school ‘Sat...|       neutral|
|CPI figures highl...|       neutral|
|Veeva Vault CTMS ...|      positive|
|Shock as Linda No...|      negative|
|Putin Calls off P...|       neutral|
|BGBS 2022 saw hig...|      positive|
|Investigation lau...|       neutral|
|Remembering Bedfo...|       neutral|
|CA WFO LOS ANGELE...|       neutral|
|Phoenix New Media...|      negative|
|Hong Kong reopens...|      positive|
|‘Corpse After Cor...|      negative|
|What makes Alabam...|      positive|
|Israeli settlers ...|       neutral|
|United Airlines l...|       neutral|
|Ambulance crews c...|      negative|
|Welltower to Part...|       neutral|
|Insteel Industrie...|       neutral|
+--------------------+--------------+
only showing top 20 rows



The content is inside `text` column and the sentiment is inside `label` column

In [25]:
trainDataset = df_spark_train
trainDataset.show()

+--------------------+--------------+
|                text|True_Sentiment|
+--------------------+--------------+
|Slovak former pre...|      negative|
|Liberty Latin Ame...|      positive|
|After-school ‘Sat...|       neutral|
|CPI figures highl...|       neutral|
|Veeva Vault CTMS ...|      positive|
|Shock as Linda No...|      negative|
|Putin Calls off P...|       neutral|
|BGBS 2022 saw hig...|      positive|
|Investigation lau...|       neutral|
|Remembering Bedfo...|       neutral|
|CA WFO LOS ANGELE...|       neutral|
|Phoenix New Media...|      negative|
|Hong Kong reopens...|      positive|
|‘Corpse After Cor...|      negative|
|What makes Alabam...|      positive|
|Israeli settlers ...|       neutral|
|United Airlines l...|       neutral|
|Ambulance crews c...|      negative|
|Welltower to Part...|       neutral|
|Insteel Industrie...|       neutral|
+--------------------+--------------+
only showing top 20 rows



In [26]:
trainDataset.count()

160

In [27]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [28]:
# Alternative architecture

document = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")
    
bert_cmlm = BertSentenceEmbeddings.pretrained('sent_bert_use_cmlm_en_base', 'en')\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("True_Sentiment")\
      .setMaxEpochs(10)\
      .setEnableOutputLogs(True)\
      .setLr(0.005)\
      .setBatchSize(20)

bert_cmlm_clf_pipeline = Pipeline(
    stages = [
        document,
        bert_cmlm,
        classifierdl
    ])

sent_bert_use_cmlm_en_base download started this may take some time.
Approximate size to download 391.6 MB
[OK!]


In [29]:
# # actual content is inside description column
# document = DocumentAssembler()\
#     .setInputCol("text")\
#     .setOutputCol("document")

# use = UniversalSentenceEncoder.pretrained() \
#  .setInputCols(["document"])\
#  .setOutputCol("sentence_embeddings")


# classsifierdl = ClassifierDLApproach()\
#   .setInputCols(["sentence_embeddings"])\
#   .setOutputCol("class")\
#   .setLabelColumn("True_Sentiment")\
#   .setMaxEpochs(5)\
#   .setEnableOutputLogs(True)

# pipeline = Pipeline(
#     stages = [
#         document,
#         use,
#         classsifierdl
#     ])

In [30]:
# Fit to the training dataset (train the model)
pipelineModel = bert_cmlm_clf_pipeline.fit(trainDataset)

# How to use already trained SentimentDL pipeline or its model

We have two ways of using what we already trained: pipeline or model.

Let's see how we can save the entire pipeline, load it, and do some prediction with that pre-trained pipeline.

## Save and load pre-trained SentimentDL pipeline

In [31]:
# Google Colab is free so it comes with a little memory. 
# It's not possible to save and load in this notebook. But you can do this locally or in a decent machine!

# pipelineModel.save("./sentimentdl_pipeline")
# loadedPipeline = PipelineModel.load("./sentimentdl_pipeline")
# loadedPipeline.transform(YOUR_DATAFRAME)

# Save and load pre-trained SentimentDL model

In [32]:
# hdfs:/ if you are saving it on distributed file systems in Hadoop
pipelineModel.stages[-1].write().overwrite().save('./tmp_sentimentdl_model')



Let's use our pre-trained SentimentDLModel in a pipeline: 

In [33]:
# Alternative architecture

document = DocumentAssembler()\
      .setInputCol("description")\
      .setOutputCol("document")
    
bert_cmlm = BertSentenceEmbeddings.pretrained('sent_bert_use_cmlm_en_base', 'en')\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLModel.load("./tmp_sentimentdl_model") \
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")

bert_cmlm_clf_pipeline = Pipeline(
    stages = [
        document,
        bert_cmlm,
        classifierdl
    ])

sent_bert_use_cmlm_en_base download started this may take some time.
Approximate size to download 391.6 MB
[OK!]


In [34]:
# # In a new pipeline you can load it for prediction
# document = DocumentAssembler()\
#     .setInputCol("text")\
#     .setOutputCol("document")

# use = UniversalSentenceEncoder.pretrained() \
#  .setInputCols(["document"])\
#  .setOutputCol("sentence_embeddings")

# classifierdl = ClassifierDLModel.load("./tmp_sentimentdl_model") \
#   .setInputCols(["sentence_embeddings"])\
#   .setOutputCol("class")

# pipeline = Pipeline(
#     stages = [
#         document,
#         use,
#         classifierdl
#     ])


# Evaluation 

Let's evaluatte our SentimentDL model we trained earlier, saved it, and loaded it into a new pipeline by using a test dataset that model has never seen:

In [35]:
testDataset = df_spark_test
preds = pipelineModel.transform(testDataset)

In [36]:
preds.select('True_Sentiment','text',"class.result").show(50, truncate=50)

+--------------+--------------------------------------------------+----------+
|True_Sentiment|                                              text|    result|
+--------------+--------------------------------------------------+----------+
|      negative|Redistricting causing confusion for voters in G...|[positive]|
|      negative|Midday Report: Whole Earth Brands (FREE) Trades...|[negative]|
|      positive|Report: Transfer Boost For Liverpool And Boruss...|[positive]|
|       neutral|Tennessee governor declines to intervene as exe...|[negative]|
|       neutral|Serena Williams, Lewis Hamilton Join U.K. Bid T...|[positive]|
|       neutral|Lindy Blanchard touts financial independence as...| [neutral]|
|       neutral|Pakistan’s new Cabinet sworn in at presidency i...| [neutral]|
|       neutral|Man United hires Erik ten Hag as 5th manager in...|[positive]|
|      positive|Alberta Investment Management Corporation Annou...|[positive]|
|      positive|Local favorites Savannah Seafood Sha

In [37]:
preds_df = preds.select('True_Sentiment','text',"class.result").toPandas()

In [38]:
# The result is an array since in Spark NLP you can have multiple sentences.
# This means you can add SentenceDetector in the pipeline and feed it into
# UniversalSentenceEncoder and you can have prediction based on each sentence.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [39]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

Let's use `classification_report` from `sklearn` to evaluate the final scores. (keep in mind due to limited resources on a free Google Colab we only used 5 Epochs :)

In [40]:
print (classification_report(preds_df['True_Sentiment'], preds_df['result'] ))

              precision    recall  f1-score   support

    negative       0.80      0.33      0.47        12
     neutral       0.73      0.57      0.64        14
    positive       0.42      0.89      0.57         9

    accuracy                           0.57        35
   macro avg       0.65      0.60      0.56        35
weighted avg       0.67      0.57      0.56        35

