<a href="https://colab.research.google.com/github/harnalashok/deeplearning/blob/main/John_Labs_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/NER_EN.ipynb)




# **Sentiment Prediction in English text**
Refer [here](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl) <br>
For appropriate spark-nlp models and pipeline make a search [here](https://nlp.johnsnowlabs.com/models)

Trains a `ClassifierDL` for generic Multi-class Text Classification. 

`ClassifierDL` uses the state-of-the-art `Universal Sentence Encoder` as an input for text classifications. The `ClassifierDL` annotator uses a deep learning model (DNNs) built inside TensorFlow and supports up to 100 classes.

For instantiated/pretrained models, see ClassifierDLModel.

Note: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol

## 1. Colab Setup
Install spark-nlp

In [None]:
# 1.0 Installs pyspark, spark-nlp and findspark

!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

# !bash colab.sh
# -p is for pyspark
# -s is for spark-nlp
# !bash colab.sh -p 3.1.1 -s 3.0.1
# by default they are set to the latest



In [None]:
#1.1  To check contents of colab.sh, just download 
#     colab.sh and examine its contents:

!wget http://setup.johnsnowlabs.com/colab.sh 

! cat /content/colab.sh


In [4]:
# Install Spark NLP Display for visualization
# !pip install --ignore-installed spark-nlp

## 2. Start the Spark session
Import libraries

Import dependencies and start Spark session.

In [12]:
# 2.0 Call libraries
import pandas as pd
import numpy as np
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType, StringType,StructField,StructType
#Replace part of string with another string
from pyspark.sql.functions import regexp_replace

In [6]:
# 2.1 Create Spark session
#     And start sparknlp

spark = sparknlp.start()

In [7]:
# 2.2 Show multiple command outputs from a cell

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


## Mount gdrive

In [15]:
# 2.3 Mount gdrive to read data
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


## 3. Read data

In [24]:
# 3.0
path = "/gdrive/MyDrive/Colab_data_files/corona_nlp/Corona_NLP_test.csv"

The csv dataset should just have two columns, *text* and *label*. *label* can be string also.


### Read using pandas

In [None]:
# 3.1 Read dataset using pandas
f = pd.read_csv(path)
f.head()
f.shape
f['Sentiment'].value_counts()


In [38]:
# 3.2 Transform pandas to spark dataframe
g = f[['OriginalTweet','Sentiment']].copy()
cor_data =spark.createDataFrame(g)

In [None]:
# 3.3 Examine it
cor_data.show()
cor_data.count()      # 3798

### Read using pyspark

In [40]:
# 4.0 Read data directly in spark a normal manner:

data = spark.read.csv(
                      path = path,
                      inferSchema=True,
                      header=True
                      )

In [None]:
# 4.1 Obviously there are problems 
data.show()
data.count()    # 6792

The mismatch between pandas read and pyspark read, necessiates examinig data closely. On examining data one finds that there are many sentences, as below:

```
7,44959,,03-03-2020,Voting in the age of #coronavirus = hand sanitizer ? #SuperTuesday https://t.co/z0BeL4O6Dk,Positive
8,44960,"Geneva, Switzerland",03-03-2020,"@DrTedros ""We cant stop #COVID19 without protecting #healthworkers.

```

The above is an extract from a single tweet. Ihe tweet is on multiple lines and also having <u>double</u>, double-inverted-commas. This complicates reading in pyspark. The solution is presented in StackOverflow at [this link](https://stackoverflow.com/a/69126284). We, therefore, proceed as follows:

In [43]:
# 4.2 First define a Spark schema
schema = StructType([ \
                     StructField("UserName",StringType(),True), \
                     StructField("ScreenName",StringType(),True), \
                     StructField("Location",StringType(),True), \
                     StructField("TweetAt", StringType(), True), \
                     StructField("OriginalTweet", StringType(), True), \
                     StructField("Sentiment", StringType(), True) \
  ])
 

In [47]:
# 4.3 Next tead the data
df = spark.read  \
                 .option("quote", "\"") \
                 .option('escape', "\"") \
                 .option("multiLine", "true")  \
                 .option("schema" , schema)  \
                 .option("header", "true") \
                 .csv(path)

In [None]:
# 4.4 The result shows complete match with pandas:
df.show()
df.count()    # 3798

## 

In [50]:
# 4.5
df= df.select('OriginalTweet', 'Sentiment')

In [None]:
# Rename columns as suggested by spark-nlp
df = df.withColumn(
                    "label",
                    df["Sentiment"]
                                    )

df = df.withColumn(
                    "text",
                     df["OriginalTweet"]
                   )

df.show()

In [None]:
df = df.replace('Extremely Negative', 'Negative')
df = df.replace('Extremely Positive', 'Positive')
df.show()

In [69]:
# Split data
train,test = df.randomSplit([0.8, 0.2])

In [70]:
train.count()   # 3067
print("\n")
test.count()    # 731

3037





761

## Process as spark-nlp

In [58]:
# Create DocumentAssembler Object:

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

In [59]:
# Transform document in Word2vec space:
# There is download of around 970mb:

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [60]:
# Define classifier object:

docClassifier = ClassifierDLApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("label") \
    .setBatchSize(64) \
    .setMaxEpochs(20) \
    .setLr(5e-3) \
    .setDropout(0.5)

In [61]:
# Create pipeline:

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        docClassifier
      ]
    )

In [71]:
# Create model:
pipelineModel = pipeline.fit(train)

In [72]:
# MAke predictions for test data:

pred = pipelineModel.transform(test)

In [64]:
pred.columns

['OriginalTweet',
 'Sentiment',
 'label',
 'text',
 'document',
 'sentence_embeddings',
 'category']

In [None]:
pred.select('label','category').show(truncate = False)

In [81]:
# Ref: https://nlp.johnsnowlabs.com/docs/en/annotators#finisher
finisher = Finisher().setInputCols("category").setOutputCols("output")

In [82]:
result = finisher.transform(pred)

['OriginalTweet', 'Sentiment', 'label', 'text', 'output']

In [88]:
p = result.select('label', 'output').toPandas()

In [91]:
p = p.explode('output')

In [93]:
# Accuracy

np.sum((p.label == p.output))/p.shape[0]

0.6202365308804205