<a href="https://colab.research.google.com/github/Brand-Sentiment-Tracking/dev-sentiment-package/blob/main/johnsnow/Train_sentiment_classifier_with_our_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP
### Multi-class Sentiment Classification
#### By using SentimentDL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/classification/SentimentDL_train_multiclass_sentiment_classifier.ipynb)

Only run this block if you are inside Google Colab otherwise skip it

In [55]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash
from sklearn.metrics import classification_report

--2022-04-28 12:40:06--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-04-28 12:40:07--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-04-28 12:40:08--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:44

In [56]:
import sparknlp

spark = sparknlp.start(gpu=True)

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version", spark.version)

Spark NLP version:  3.4.3
Apache Spark version 3.0.3


### Load political dataset

In [57]:
# Convert Kaggle data to Pandas dataframe and preprocess
import pandas as pd
from pyspark.sql.functions import rand 
from sklearn.model_selection import train_test_split

sentiment_url_1 = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/python-package/main/data/sentiment_test_data.csv' # Financial Phrase Bank
sentiment_url_2 = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/python-package/main/data/SEN_en_R.csv' # SEN data
sentiment_url_3 = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/dev-sentiment-package/main/data/labelled_1.csv' # owr extracted data



# # Create a preprocessed spark dataframe
# from pyspark import SparkFiles
# spark.sparkContext.addFile(sentiment_url_3)

# # Read financial dataframe
# df_spark = spark.read.option("header","false").csv("file://"+SparkFiles.get("sentiment_test_data.csv"))
# df_spark = df_spark.withColumnRenamed("_c0", "label").withColumnRenamed("_c1", "text")
# # Shuffle dataset
# df_spark = df_spark.orderBy(rand(2))

# # Read political dataframe
# df_spark = spark.read.option("header","true").csv("file://"+SparkFiles.get("SEN_en_R.csv"))
# df_spark = df_spark.withColumnRenamed("_c0", "idx").withColumnRenamed("headline", "text") \
#                                                  .withColumnRenamed("majority_label", 'label')
# # Replace abbreviations with full lables and filter the unk datapoints (applies only to political dataset)                                         
# df_spark = df_spark.replace("neutr", "neutral").replace("pos", "positive").replace("neg","negative")
# df_spark = df_spark[df_spark.label != "unk"]

# # Read our own dataframe

cols_to_read = ['text',"sentiment (Max's take)"]
df_pandas = pd.read_csv(sentiment_url_3, usecols=cols_to_read)

# Rename sentiment to True_Sentiment
df_pandas.rename(columns={"sentiment (Max's take)":"True_Sentiment"},inplace=True)

# # Make dataset smaller for faster runtime
num_sentences = 500
total_num_sentences = df_pandas.shape[0]
df_pandas.drop(df_pandas.index[num_sentences:total_num_sentences], inplace=True)

# Replace 1, 2 , 3 with negative, neutral, positive
df_pandas["True_Sentiment"].replace({1.0: "negative", 2.0: "neutral", 3.0: "positive"}, inplace=True)


In [58]:
# Downsample majority class neutral
negative_points = df_pandas[df_pandas["True_Sentiment"] == 'negative'].head(65)
neutral_points = df_pandas[df_pandas["True_Sentiment"] == 'neutral'].head(115)
positive_points = df_pandas[df_pandas["True_Sentiment"] == 'positive'].head(50)

print(len(neutral_points))
print(len(positive_points))
print(len(negative_points))

# Split into train and test
negative_train, negative_test_and_val = train_test_split(negative_points, test_size = 0.3) # split negative points into 70% training 30% test+val
negative_test, negative_val = train_test_split(negative_test_and_val, test_size = 0.5) # split into half test = 15% and half val = 15%

neutral_train, neutral_test_and_val = train_test_split(neutral_points, test_size = 0.3) # split negative points into 70% training 30% test+val
neutral_test, neutral_val = train_test_split(neutral_test_and_val, test_size = 0.5) # split into half test = 15% and half val = 15%

positive_train, positive_test_and_val = train_test_split(positive_points, test_size = 0.3) # split negative points into 70% training 30% test+val
positive_test, positive_val = train_test_split(positive_test_and_val, test_size = 0.5) # split into half test = 15% and half val = 15%


# Concatenate into training testing and validation datasets and shuffle
df_pandas_train = pd.concat([negative_train, neutral_train, positive_train]).sample(frac=1, random_state = 0)
df_pandas_test = pd.concat([negative_test, neutral_test, positive_test]).sample(frac=1, random_state = 0)
df_pandas_val = pd.concat([negative_val, neutral_val, positive_val]).sample(frac=1, random_state = 0)


# # Convert to spark dataframe  
df_spark_train = spark.createDataFrame(df_pandas_train)
df_spark_test = spark.createDataFrame(df_pandas_test)
df_spark_val = spark.createDataFrame(df_pandas_val)

df_spark_train.show()

# Check that the training dataset is balanced 
print(df_spark_train[df_spark_train.True_Sentiment == "neutral"].count())
print(df_spark_train[df_spark_train.True_Sentiment == "negative"].count())
print(df_spark_train[df_spark_train.True_Sentiment == "positive"].count())

115
50
65
+--------------------+--------------+
|                text|True_Sentiment|
+--------------------+--------------+
|Majority Of Ameri...|       neutral|
|Serena Williams, ...|       neutral|
|UN: Yemen’s Houth...|      positive|
|IPL 2022: It's Hi...|      negative|
|Bakery owner accu...|      negative|
|Russia pours in m...|       neutral|
|Remembering Bedfo...|       neutral|
|The Daily Ripple-...|       neutral|
|Sigma Planning Co...|       neutral|
|Sigma Planning Co...|       neutral|
|Bunzl (LON:BNZL) ...|      positive|
|Kentucky abortion...|       neutral|
|This Earth Day, B...|      negative|
|Access controls b...|      positive|
|CA WFO LOS ANGELE...|       neutral|
|WELL Health to Pr...|       neutral|
|Hong Kong Disneyl...|       neutral|
|Snapdragon 888 vs...|       neutral|
|Indian shares see...|      positive|
|AMREP Co. (NYSE:A...|       neutral|
+--------------------+--------------+
only showing top 20 rows

80
45
35


In [59]:
# # Split into train and test
# df_spark_train = df_spark.limit(round(df_spark.count()*0.7)) # 70% train
# df_spark_test_and_val = df_spark.exceptAll(df_spark_train) # 30% test + val
# df_spark_test = df_spark_test_and_val.limit(round(df_spark_test_and_val.count()*0.5)) # Take half for testing  = 15%
# df_spark_val = df_spark_test_and_val.exceptAll(df_spark_test) # and half for validation = 15%

# # Show the training set
# print(f"Number of training points: {df_spark_train.count()}")
# print(f"Number of testing points: {df_spark_test.count()}")
# print(f"Number of validation points: {df_spark_val.count()}")
# df_spark_train.show()

In [60]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [101]:
# Alternative architecture

document = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")
    
bert_cmlm = BertSentenceEmbeddings.pretrained('sent_bert_use_cmlm_en_base', 'en')\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("True_Sentiment")\
      .setMaxEpochs(9)\
      .setEnableOutputLogs(True)\
      .setLr(0.005)\
      .setBatchSize(10)

bert_cmlm_clf_pipeline = Pipeline(
    stages = [
        document,
        bert_cmlm,
        classifierdl
    ])

sent_bert_use_cmlm_en_base download started this may take some time.
Approximate size to download 391.6 MB
[OK!]


In [102]:
# Fit to the training dataset (train the model)
pipelineModel = bert_cmlm_clf_pipeline.fit(df_spark_train)

# How to use already trained SentimentDL pipeline or its model

We have two ways of using what we already trained: pipeline or model.

Let's see how we can save the entire pipeline, load it, and do some prediction with that pre-trained pipeline.

## Save and load pre-trained SentimentDL pipeline

In [103]:
# Google Colab is free so it comes with a little memory. 
# It's not possible to save and load in this notebook. But you can do this locally or in a decent machine!

# pipelineModel.save("./sentimentdl_pipeline")
# loadedPipeline = PipelineModel.load("./sentimentdl_pipeline")
# loadedPipeline.transform(YOUR_DATAFRAME)

# Save and load pre-trained SentimentDL model

In [104]:
# hdfs:/ if you are saving it on distributed file systems in Hadoop
pipelineModel.stages[-1].write().overwrite().save('./tmp_sentimentdl_model')

Let's use our pre-trained SentimentDLModel in a pipeline: 

In [105]:
load = False
if load == True:
    # Alternative architecture
    document = DocumentAssembler()\
          .setInputCol("description")\
          .setOutputCol("document")
        
    bert_cmlm = BertSentenceEmbeddings.pretrained('sent_bert_use_cmlm_en_base', 'en')\
          .setInputCols(["document"])\
          .setOutputCol("sentence_embeddings")

    classifierdl = ClassifierDLModel.load("./tmp_sentimentdl_model") \
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")

    bert_cmlm_clf_pipeline = Pipeline(
        stages = [
            document,
            bert_cmlm,
            classifierdl
        ])
    
else:
    bert_cmlm_clf_pipeline = pipelineModel

### Validate

In [106]:
preds_val = pipelineModel.transform(df_spark_val)
preds_df_val = preds_val.select('True_Sentiment','text',"class.result").toPandas()
preds_df_val['result'] = preds_df_val['result'].apply(lambda x : x[0])
print (classification_report(preds_df_val['True_Sentiment'], preds_df_val['result'] ))

              precision    recall  f1-score   support

    negative       0.62      0.80      0.70        10
     neutral       0.57      0.72      0.63        18
    positive       0.00      0.00      0.00         8

    accuracy                           0.58        36
   macro avg       0.39      0.51      0.44        36
weighted avg       0.45      0.58      0.51        36



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Evaluation 

Let's evaluate our ClassifierDL model we trained earlier, saved it, and loaded it into a new pipeline by using a test dataset that model has never seen:

In [107]:
# Pretrained model
from sparknlp.pretrained import PretrainedPipeline
identifier_pretrained =  PretrainedPipeline("classifierdl_bertwiki_finance_sentiment_pipeline", lang='en')

classifierdl_bertwiki_finance_sentiment_pipeline download started this may take some time.
Approx size to download 412.9 MB
[OK!]


In [108]:
preds = pipelineModel.transform(df_spark_test)
preds_2 = identifier_pretrained.transform(df_spark_test)

In [109]:
preds.select('True_Sentiment','text',"class.result").show(50, truncate=50)
preds_2.select('True_Sentiment','text',"class.result").show(50, truncate=50)

+--------------+--------------------------------------------------+----------+
|True_Sentiment|                                              text|    result|
+--------------+--------------------------------------------------+----------+
|       neutral|China's Xi proposes 'global security initiative...| [neutral]|
|       neutral|New Jersey's recreational marijuana market open...| [neutral]|
|       neutral|       Huntington Bancshares: Q1 Earnings Snapshot| [neutral]|
|       neutral|               Remembering Beaver Area High School| [neutral]|
|      negative|Ford recalls over 650K trucks; windshield wiper...| [neutral]|
|       neutral|Man United hire Erik ten Hag as 5th manager in ...|[negative]|
|      positive|Veterans Minister congratulates RSA on 100 year...| [neutral]|
|       neutral|              Remembering Beaver Local High School| [neutral]|
|       neutral|     Welltower to Participate in Investor Meetings| [neutral]|
|      positive|RYU Apparel Brings Customer Care Ser

In [110]:
preds_df = preds.select('True_Sentiment','text',"class.result").toPandas()
preds_df_2 = preds_2.select('True_Sentiment','text',"class.result").toPandas()

In [111]:
# The result is an array since in Spark NLP you can have multiple sentences.
# This means you can add SentenceDetector in the pipeline and feed it into
# UniversalSentenceEncoder and you can have prediction based on each sentence.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])
preds_df_2['result'] = preds_df_2['result'].apply(lambda x : x[0])

Let's use `classification_report` from `sklearn` to evaluate the final scores.

In [112]:
print (classification_report(preds_df['True_Sentiment'], preds_df['result'] ))

              precision    recall  f1-score   support

    negative       0.75      0.60      0.67        10
     neutral       0.64      0.94      0.76        17
    positive       1.00      0.14      0.25         7

    accuracy                           0.68        34
   macro avg       0.80      0.56      0.56        34
weighted avg       0.75      0.68      0.63        34



In [113]:
print (classification_report(preds_df_2['True_Sentiment'], preds_df_2['result'] ))

              precision    recall  f1-score   support

    negative       0.83      0.50      0.62        10
     neutral       0.70      0.94      0.80        17
    positive       0.60      0.43      0.50         7

    accuracy                           0.71        34
   macro avg       0.71      0.62      0.64        34
weighted avg       0.72      0.71      0.69        34

