<a href="https://colab.research.google.com/github/Brand-Sentiment-Tracking/dev-sentiment-package/blob/main/johnsnow/Fit_sentiment_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP
### Multi-class Sentiment Classification
#### By using SentimentDL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/classification/SentimentDL_train_multiclass_sentiment_classifier.ipynb)

Only run this block if you are inside Google Colab otherwise skip it

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2022-04-22 11:27:28--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://setup.johnsnowlabs.com/colab.sh [following]
--2022-04-22 11:27:29--  https://setup.johnsnowlabs.com/colab.sh
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2022-04-22 11:27:29--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:44

In [2]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version", spark.version)


Spark NLP version:  3.4.3
Apache Spark version 3.0.3


### Load political dataset

In [60]:
# Convert Kaggle data to Pandas dataframe and preprocess
import pandas as pd
from pyspark.sql.functions import rand 

sentiment_url_1 = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/python-package/main/data/sentiment_test_data.csv' # Financial Phrase Bank
sentiment_url_2 = 'https://raw.githubusercontent.com/Brand-Sentiment-Tracking/python-package/main/data/SEN_en_R.csv' # SEN data

# Create a preprocessed spark dataframe
from pyspark import SparkFiles
spark.sparkContext.addFile(sentiment_url_2)

# # Read financial dataframe
# df_spark = spark.read.option("header","false").csv("file://"+SparkFiles.get("sentiment_test_data.csv"))
# df_spark = df_spark.withColumnRenamed("_c0", "label").withColumnRenamed("_c1", "text")
# # Shuffle dataset
# df_spark = df_spark.orderBy(rand(2))

# Read political dataframe
df_spark = spark.read.option("header","true").csv("file://"+SparkFiles.get("SEN_en_R.csv"))
df_spark = df_spark.withColumnRenamed("_c0", "idx").withColumnRenamed("headline", "text") \
                                                 .withColumnRenamed("majority_label", 'label')




# Replace abbreviations with full lables and filter the unk datapoints (applies only to political dataset)                                         
df_spark = df_spark.replace("neutr", "neutral").replace("pos", "positive").replace("neg","negative")
# print(df_spark.count())
df_spark = df_spark[df_spark.label != "unk"]
# print(df_spark.count())

# Split into train and test
df_spark_train = df_spark.limit(round(df_spark.count()*0.8))
df_spark_test = df_spark.subtract(df_spark_train)

df_spark_train.show()

+---+--------------------+------+--------+
|idx|                text|entity|   label|
+---+--------------------+------+--------+
|  0|Boris Johnson Joi...| Trump| neutral|
|  1|Trump Praises Con...| Trump|negative|
|  2|Stung by Trump’s ...| Trump| neutral|
|  3|What ‘America Fir...| Trump| neutral|
|  4|Trump Calls for N...| Trump|positive|
|  5|Trump Set to Meet...| Trump| neutral|
|  6|Trump Clashes Wit...| Trump| neutral|
|  7|Donald Trump Won’...| Trump|positive|
|  8|Trump to Attend N...| Trump| neutral|
|  9|Trump-Erdogan Rap...| Trump| neutral|
| 10|Trump Says Erdoga...| Trump| neutral|
| 11|Ukrainian Preside...| Trump| neutral|
| 12|In Months Before ...| Trump|negative|
| 13|Ukraine to Review...| Biden| neutral|
| 14|For Trump, Long O...| Trump| neutral|
| 15|Taliban Negotiato...| Trump|negative|
| 16|Trump Administrat...| Trump|negative|
| 17|U.S. Sanctions Ti...| Putin| neutral|
| 18|New York State Su...| Trump| neutral|
| 19|Trump’s NASA Budg...| Trump|positive|
+---+------

The content is inside `text` column and the sentiment is inside `label` column

In [61]:
trainDataset = df_spark_train
trainDataset.show()

+---+--------------------+------+--------+
|idx|                text|entity|   label|
+---+--------------------+------+--------+
|  0|Boris Johnson Joi...| Trump| neutral|
|  1|Trump Praises Con...| Trump|negative|
|  2|Stung by Trump’s ...| Trump| neutral|
|  3|What ‘America Fir...| Trump| neutral|
|  4|Trump Calls for N...| Trump|positive|
|  5|Trump Set to Meet...| Trump| neutral|
|  6|Trump Clashes Wit...| Trump| neutral|
|  7|Donald Trump Won’...| Trump|positive|
|  8|Trump to Attend N...| Trump| neutral|
|  9|Trump-Erdogan Rap...| Trump| neutral|
| 10|Trump Says Erdoga...| Trump| neutral|
| 11|Ukrainian Preside...| Trump| neutral|
| 12|In Months Before ...| Trump|negative|
| 13|Ukraine to Review...| Biden| neutral|
| 14|For Trump, Long O...| Trump| neutral|
| 15|Taliban Negotiato...| Trump|negative|
| 16|Trump Administrat...| Trump|negative|
| 17|U.S. Sanctions Ti...| Putin| neutral|
| 18|New York State Su...| Trump| neutral|
| 19|Trump’s NASA Budg...| Trump|positive|
+---+------

In [62]:
trainDataset.count()

990

In [45]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [116]:
# actual content is inside description column
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")


classsifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("label")\
  .setMaxEpochs(10)\
  .setEnableOutputLogs(True)

pipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [117]:
# Fit to the training dataset (train the model)
pipelineModel = pipeline.fit(trainDataset)

# How to use already trained SentimentDL pipeline or its model

We have two ways of using what we already trained: pipeline or model.

Let's see how we can save the entire pipeline, load it, and do some prediction with that pre-trained pipeline.

## Save and load pre-trained SentimentDL pipeline

In [118]:
# Google Colab is free so it comes with a little memory. 
# It's not possible to save and load in this notebook. But you can do this locally or in a decent machine!

# pipelineModel.save("./sentimentdl_pipeline")
# loadedPipeline = PipelineModel.load("./sentimentdl_pipeline")
# loadedPipeline.transform(YOUR_DATAFRAME)

# Save and load pre-trained SentimentDL model

In [119]:
# hdfs:/ if you are saving it on distributed file systems in Hadoop
pipelineModel.stages[-1].write().overwrite().save('./tmp_sentimentdl_model')



Let's use our pre-trained SentimentDLModel in a pipeline: 

In [120]:

# In a new pipeline you can load it for prediction
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLModel.load("./tmp_sentimentdl_model") \
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")

pipeline = Pipeline(
    stages = [
        document,
        use,
        classifierdl
    ])


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Now let's load it back so we can have prediction all together with everything in that pipeline:

In [121]:
# from pyspark.sql.types import StringType

# dfTest = spark.createDataFrame([
#     "This movie is a delight for those of all ages. I have seen it several times and each time I am enchanted by the characters and magic. The cast is outstanding, the special effects delightful, everything most believable.",
#     "This film was to put it simply rubbish. The child actors couldn't act, as can be seen by Harry's supposed surprise on learning he's a wizard. I'm a wizard! is said with such indifference you'd think he's not surprised at all."
# ], StringType()).toDF("text")

In [122]:
# dfTest = trainDataset
# prediction = pipeline.fit(dfTest).transform(dfTest)

In [123]:
# prediction.select("class.result").show()

# prediction.select("class.metadata").show(truncate=False)

# Evaluation 

Let's evaluatte our SentimentDL model we trained earlier, saved it, and loaded it into a new pipeline by using a test dataset that model has never seen:

In [124]:
testDataset = df_spark_test
preds = pipelineModel.transform(testDataset)

In [125]:
preds.select('label','text',"class.result").show(50, truncate=50)

+--------+--------------------------------------------------+----------+
|   label|                                              text|    result|
+--------+--------------------------------------------------+----------+
|negative|Pelosi moves to end Trump impeachment standoff,...| [neutral]|
| neutral|            Iran gives Trump an off-ramp — for now| [neutral]|
| neutral|House passes $1.4 trillion spending bill with T...| [neutral]|
| neutral|At Las Vegas events, Joe Biden and Pete Buttigi...| [neutral]|
| neutral|Biden says he would testify in Senate impeachme...| [neutral]|
| neutral|Mike Bloomberg pitches himself to Jewish Americ...| [neutral]|
| neutral|Democrats scale back language as Trump and GOP ...| [neutral]|
|positive|Trump praises Pompeo for echoing his behavior: ...|[negative]|
| neutral|CNN Don Lemon segment mocking Trump supporters ...|[negative]|
|negative|  Trump has a rare public moment of self-awareness|[negative]|
| neutral|         The Trailer: Who's voting for To

In [126]:
preds_df = preds.select('label','text',"class.result").toPandas()

In [127]:
# The result is an array since in Spark NLP you can have multiple sentences.
# This means you can add SentenceDetector in the pipeline and feed it into
# UniversalSentenceEncoder and you can have prediction based on each sentence.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [128]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

Let's use `classification_report` from `sklearn` to evaluate the final scores. (keep in mind due to limited resources on a free Google Colab we only used 5 Epochs :)

In [129]:
print (classification_report(preds_df['result'], preds_df['label']))

              precision    recall  f1-score   support

    negative       0.52      0.52      0.52        98
     neutral       0.75      0.56      0.64       149
    positive       0.00      0.00      0.00         0

    accuracy                           0.54       247
   macro avg       0.42      0.36      0.39       247
weighted avg       0.66      0.54      0.59       247



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
