<a href="https://colab.research.google.com/github/Bang2018/SPARK-ML/blob/main/Spark_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Introduction**

###This tutorial explains the following:

   ### 1. Installation of Spark NLP
   ### 2. Classifier for 1.3 million sarcastic comments
   ### 3. Text summarization using T5 Transformer

###The dataset contains 1.3 million Sarcastic comments from the Internet commentary website Reddit. The data was gathered by: Mikhail Khodak and Nikunj Saunshi and Kiran Vodrahalli for their article "A Large Self-Annotated Corpus for Sarcasm". 

###**Contribution**: My classifier got 64% accuracy. John Snow LABS got 60% accuracy. 
                 

##Installation of Spark-NLP

In [None]:
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2021-09-02 18:32:09--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2021-09-02 18:32:09--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1608 (1.6K) [text/plain]
Saving to: ‘STDOUT’


2021-09-02 18:32:10 (34.3 MB/s) - written to stdout [1608/1608]

setup Colab for PySpark 3.1.2 and Spark NLP 3.2.2
Hit:1 https://cloud.r-project.org/bin/li

In [None]:
import sparknlp
spark = sparknlp.start()

print(f"Spark NLP version: {sparknlp.version()}")
print(f"Apache Spark version: {spark.version}")

Spark NLP version: 3.2.2
Apache Spark version: 3.1.2


##Load Data

In [None]:
! wget -N https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv -P /tmp

--2021-09-02 18:33:02--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/sarcasm/train-balanced-sarcasm.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.38.254
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.38.254|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘/tmp/train-balanced-sarcasm.csv’ not modified on server. Omitting download.



In [None]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("ML SQL session") \
    .config('spark.executor.instances','2') \
    .config("spark.executor.memory", "2g") \
    .config("spark.driver.memory","16g") \
    .getOrCreate()

In [None]:
from pyspark.sql import SQLContext
sql = SQLContext(spark)
trainDF = spark.read.option("header", True).option("inferSchema", True).csv("/tmp/train-balanced-sarcasm.csv")
trainDF.printSchema()
#Creating View
trainDF.createOrReplaceTempView('Sarcasm')
sql.sql('SELECT COUNT(*) FROM Sarcasm').collect()

root
 |-- label: integer (nullable = true)
 |-- comment: string (nullable = true)
 |-- author: string (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- score: string (nullable = true)
 |-- ups: string (nullable = true)
 |-- downs: string (nullable = true)
 |-- date: string (nullable = true)
 |-- created_utc: string (nullable = true)
 |-- parent_comment: string (nullable = true)



[Row(count(1)=1010826)]

In [None]:
#Check the datatype
type(trainDF)

pyspark.sql.dataframe.DataFrame

In [None]:
sql.sql('select * from Sarcasm where author="Trumpbart"').show()

+-----+----------------+---------+---------+-----+---+-----+-------+-------------------+--------------------+
|label|         comment|   author|subreddit|score|ups|downs|   date|        created_utc|      parent_comment|
+-----+----------------+---------+---------+-----+---+-----+-------+-------------------+--------------------+
|    0|      NC and NH.|Trumpbart| politics|    2| -1|   -1|2016-10|2016-10-16 23:55:23|Yeah, I get that ...|
|    1|Very surprising!|Trumpbart| politics|    1| -1|   -1|2016-10|2016-10-26 04:35:59|t_d is defending ...|
+-----+----------------+---------+---------+-----+---+-----+-------+-------------------+--------------------+



In [None]:
sql.sql('select label,count(1) as label_count from Sarcasm group by label order by label_count desc').show()

+-----+-----------+
|label|label_count|
+-----+-----------+
|    1|     505413|
|    0|     505413|
+-----+-----------+



In [None]:
df = sql.sql('select label,concat(parent_comment,"|",comment) as comment from Sarcasm where comment is not null and parent_comment is not null limit 10000')
df.printSchema()
df.show()

root
 |-- label: integer (nullable = true)
 |-- comment: string (nullable = true)

+-----+--------------------+
|label|             comment|
+-----+--------------------+
|    0|Yeah, I get that ...|
|    0|The blazers and M...|
|    0|They're favored t...|
|    0|deadass don't kil...|
|    0|Yep can confirm I...|
|    0|do you find arian...|
|    0|What's your weird...|
|    0|Probably Sephirot...|
|    0|What to upgrade? ...|
|    0|Probably count Ka...|
|    0|I bet if that mon...|
|    0|James Shields Wil...|
|    0|There's no time t...|
|    0|Team Specific Thr...|
|    0|Ill give you a hi...|
|    0|Star Wars, easy. ...|
|    0|You're adorable.|...|
|    0|He actually acts ...|
|    0|Clinton struggles...|
|    0|Is that the Older...|
+-----+--------------------+
only showing top 20 rows



In [None]:
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from pyspark.ml import Pipeline


docs_assembler = DocumentAssembler() \
    .setInputCol("comment") \
    .setOutputCol("document")
    
sent_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence") \
    .setUseAbbreviations(True)
    
token = Tokenizer() \
  .setInputCols(["sentence"]) \
  .setOutputCol("token")

stem = Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")
    
norm = Normalizer() \
    .setInputCols(["stem"]) \
    .setOutputCol("normalized")

result = Finisher() \
    .setInputCols(["normalized"]) \
    .setOutputCols(["ntokens"]) \
    .setOutputAsArray(True) \
    .setCleanAnnotations(True)

nlp_pipe = Pipeline(stages=[docs_assembler, sent_detector, token, stem, norm,result])
nlp_model = nlp_pipe.fit(df)
preprocessing = nlp_model.transform(df).persist()
preprocessing.count()
preprocessing.show()


+-----+--------------------+--------------------+
|label|             comment|             ntokens|
+-----+--------------------+--------------------+
|    0|Yeah, I get that ...|[yeah, i, get, th...|
|    0|The blazers and M...|[the, blazer, and...|
|    0|They're favored t...|[theyr, favor, to...|
|    0|deadass don't kil...|[deadass, dont, k...|
|    0|Yep can confirm I...|[yep, can, confir...|
|    0|do you find arian...|[do, you, find, a...|
|    0|What's your weird...|[what, your, weir...|
|    0|Probably Sephirot...|[probabl, sephiro...|
|    0|What to upgrade? ...|[what, to, upgrad...|
|    0|Probably count Ka...|[probabl, count, ...|
|    0|I bet if that mon...|[i, bet, if, that...|
|    0|James Shields Wil...|[jame, shield, wi...|
|    0|There's no time t...|[there, no, time,...|
|    0|Team Specific Thr...|[team, specif, th...|
|    0|Ill give you a hi...|[ill, give, you, ...|
|    0|Star Wars, easy. ...|[star, war, easi,...|
|    0|You're adorable.|...|  [your, ador, note]|


In [None]:
train, test = preprocessing.randomSplit(weights=[0.7, 0.3], seed=123)
print(f"Length of train dataset {train.count()}")
print(f"Length of test dataset {test.count()}")

Length of train dataset 6934
Length of test dataset 3066


In [None]:
from pyspark.ml import feature as spark_ft

stopWords = spark_ft.StopWordsRemover.loadDefaultStopWords('english')
sw_remover = spark_ft.StopWordsRemover(inputCol='ntokens', outputCol='clean_tokens', stopWords=stopWords)
tf = spark_ft.CountVectorizer(vocabSize=500, inputCol='clean_tokens', outputCol='tf')
idf = spark_ft.IDF(minDocFreq=5, inputCol='tf', outputCol='idf')

feature_pipeline = Pipeline(stages=[sw_remover, tf, idf])
feature_model = feature_pipeline.fit(train)

train_featurized = feature_model.transform(train).persist()
train_featurized.count()
train_featurized.show()

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|label|             comment|             ntokens|        clean_tokens|                  tf|                 idf|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|    0|"""...newtons."" ...|[newton, which, i...|[newton, dont, ge...|(500,[0,6,165],[1...|(500,[0,6,165],[1...|
|    0|"""Agreed. I thin...|[agr, i, think, w...|[agr, think, issu...|(500,[0,1,7,13,31...|(500,[0,1,7,13,31...|
|    0|"""And an underag...|[and, an, underag...|[underag, girlthi...|         (500,[],[])|         (500,[],[])|
|    0|"""As an excuse""...|[a, an, excus, so...|[excus, sorri, bu...|(500,[18,163,266]...|(500,[18,163,266]...|
|    0|"""Budaya Ketimur...|[budaya, ketimura...|[budaya, ketimura...|(500,[0,1,7,35,15...|(500,[0,1,7,35,15...|
|    0|"""Dear so called...|[dear, so, call, ...|[dear, call, gran...|    (500,[71],[2.0])|(500,

In [None]:
train_featurized.groupBy("label").count().show()
train_featurized.printSchema()

+-----+-----+
|label|count|
+-----+-----+
|    0| 4357|
|    1| 2577|
+-----+-----+

root
 |-- label: integer (nullable = true)
 |-- comment: string (nullable = true)
 |-- ntokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- clean_tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- tf: vector (nullable = true)
 |-- idf: vector (nullable = true)



In [None]:
from pyspark.ml import classification as spark_cls
rf = spark_cls.GBTClassifier(labelCol="label", featuresCol="idf")
model = rf.fit(train_featurized)

In [None]:
test_featurized = feature_model.transform(test)
preds = model.transform(test_featurized)
preds.show()

+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|label|             comment|             ntokens|        clean_tokens|                  tf|                 idf|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|    0|"""An unmet playe...|[an, unmet, playe...|[unmet, player, h...|(500,[0,1,7,9,12,...|(500,[0,1,7,9,12,...|[0.29389294149927...|[0.64285696333844...|       0.0|
|    0|"""Christ Enthusi...|[christ, enthusia...|[christ, enthusia...|(500,[3,7,46,56,5...|(500,[3,7,46,56,5...|[0.25503848824496...|[0.62482451938123...|       0.0|
|    0|"""Did Hillary Cl...|[did, hillari, cl...|[hillari, clinton...|(500,[19,106,198,...|(500,[19,106,198,...|[0.26936459449257...|[0.63151674521756...|       0.0|
|   

In [None]:
from sklearn.metrics import accuracy_score
import numpy
preds = preds.toPandas()
result = accuracy_score(preds["label"],preds["prediction"])
print(f"Accuracy of the Classifier {round(result*100,2)}")

Accuracy of the Classifier 64.42


#Text-to-Text Transfer Transformer(T5)



###Text Summarization using T5 Transformer

In [None]:
docs_assemblers = DocumentAssembler()\
                  .setInputCol("text")\
                  .setOutputCol("documents")

t5_transformer = T5Transformer()\
                 .pretrained("t5_small","en")\
                 .setTask("summarize")\
                 .setMaxOutputLength(1000)\
                 .setInputCols(["documents"])\
                 .setOutputCol("summaries")

result = Pipeline(stages=[docs_assemblers,t5_transformer])
                    

t5_small download started this may take some time.
Approximate size to download 139 MB
[OK!]


In [None]:
df = spark.createDataFrame([[""]]).toDF("text")
df.show()

+----+
|text|
+----+
|    |
+----+



In [None]:
model = result.fit(df)
model2 = LightPipeline(model)

In [None]:
wiki_corpora_AI = """
Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by humans or animals. Leading AI textbooks define the field as the study of "intelligent agents": any system that perceives its environment and takes actions that maximize its chance of achieving its goals.Some popular accounts use the term "artificial intelligence" to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving", however this definition is rejected by major AI researchers.

AI applications include advanced web search engines, recommendation systems (used by YouTube, Amazon and Netflix), understanding human speech (such as Siri or Alexa), self-driving cars (e.g. Tesla), and competing at the highest level in strategic game systems (such as chess and Go),[2] As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology.

Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding.[7][10] AI research has tried and discarded many different approaches during its lifetime, including simulating the brain, modeling human problem solving, formal logic, large databases of knowledge and imitating animal behavior. In the first decades of the 21st century, highly mathematical statistical machine learning has dominated the field, and this technique has proved highly successful, helping to solve many challenging problems throughout industry and academia.
"""

In [None]:
result = model2.fullAnnotate(wiki_corpora_AI)[0]
result

{'documents': [Annotation(document, 0, 1920, 
  Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by humans or animals. Leading AI textbooks define the field as the study of "intelligent agents": any system that perceives its environment and takes actions that maximize its chance of achieving its goals.Some popular accounts use the term "artificial intelligence" to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving", however this definition is rejected by major AI researchers.
  
  AI applications include advanced web search engines, recommendation systems (used by YouTube, Amazon and Netflix), understanding human speech (such as Siri or Alexa), self-driving cars (e.g. Tesla), and competing at the highest level in strategic game systems (such as chess and Go),[2] As machines become increasingly capable, tasks considered to require "int

In [None]:
print("Summary :",result["summaries"][0].result)

Summary : leading AI textbooks define the field as the study of "intelligent agents" the term "artificial intelligence" is used to describe machines that mimic "cognitive" functions that humans associate with the human mind . the definition of AI is rejected by major AI researchers .


## Text Summarization using Summa

In [None]:
!pip install summa

Collecting summa
  Downloading summa-1.2.0.tar.gz (54 kB)
[?25l[K     |██████                          | 10 kB 18.1 MB/s eta 0:00:01[K     |████████████                    | 20 kB 21.4 MB/s eta 0:00:01[K     |██████████████████              | 30 kB 8.0 MB/s eta 0:00:01[K     |███████████████████████▉        | 40 kB 7.0 MB/s eta 0:00:01[K     |█████████████████████████████▉  | 51 kB 3.3 MB/s eta 0:00:01[K     |████████████████████████████████| 54 kB 1.5 MB/s 
Building wheels for collected packages: summa
  Building wheel for summa (setup.py) ... [?25l[?25hdone
  Created wheel for summa: filename=summa-1.2.0-py3-none-any.whl size=54410 sha256=14de373437f96576ae1a564727330a149e386eabeba6291b6d7535f7abfc50d4
  Stored in directory: /root/.cache/pip/wheels/29/64/ac/7b443477588d365ef37ada30d456bdf5f07dc5be9f6324cb6e
Successfully built summa
Installing collected packages: summa
Successfully installed summa-1.2.0


In [None]:
from summa.summarizer import summarize
print(summarize(wiki_corpora_AI, ratio=0.3))

Leading AI textbooks define the field as the study of "intelligent agents": any system that perceives its environment and takes actions that maximize its chance of achieving its goals.Some popular accounts use the term "artificial intelligence" to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving", however this definition is rejected by major AI researchers.


##References:

1. https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/T5TRANSFORMER.ipynb

2. https://towardsdatascience.com/hands-on-googles-text-to-text-transfer-transformer-t5-with-spark-nlp-6f7db75cecff

3. https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/10.T5_Workshop_with_Spark_NLP.ipynb

4. https://en.wikipedia.org/wiki/Artificial_intelligence

