Question 2: Spark NLP

In this task you are required to import SparkNLP library to implement a text classification model.
Use AGNews dataset (check reference 1 about how you can download it) to train for 5 epochs and compare
the performance of these different models:

a) Use BERT embeddings with a generic annotator model in SparkNLP called ClassifierDL, without
any text preprocessing steps and find the test accuracy for it.

b) Add preprocessing steps, specifically lemmatization and stop word removal, to the pipeline in (a)
and compare its impact on the overall performance of the model. Report the test accuracies when
each step is implemented individually and when they are used together. Identify the pipeline that
yields the highest test accuracy and give a brief explanation of why it performs the best.

c) Lastly, select the best pipeline from (a) and (b) and use RoBerta embeddings instead of BERT
embeddings. Report which embedding gives the best results and why.

You can use Google Colab to do this task. Use the following links for reference:
1) https://github.com/JohnSnowLabs/spark-nlpworkshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_example
s_in_SparkML_SparkNLP.ipynb
2) https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universalsentence-encoders-e644d618ca32
3) https://nlp.johnsnowlabs.com/docs/en/quickstart
4) https://github.com/JohnSnowLabs/spark-nlpworkshop/tree/master/tutorials/Certification_Trainings/Public

#Set up SparkNLP

In [None]:
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.3.0 -s 4.3.2 -g

In [None]:
import sparknlp

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pandas as pd
import os

spark = sparknlp.start(gpu = True)# for GPU training >> sparknlp.start(gpu = True)

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

#Import data and set up train, test set

In [None]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_test.csv

In [None]:
from pyspark.sql.functions import col

trainDataset = spark.read \
      .option("header", True) \
      .csv("news_category_train.csv")

trainDataset.groupBy("category") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

testDataset = spark.read \
      .option("header", True) \
      .csv("news_category_test.csv")


testDataset.groupBy("category") \
      .count() \
      .orderBy(col("count").desc()) \
      .show()

#a, Use BERT embeddings and ClassifiedDL on raw data

install BERT


In [None]:
!pip install -q transformers==4.15.0 tensorflow==2.11.0

In [None]:
#start sparkNLP by Document Assembler call
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

bert_sent = BertSentenceEmbeddings.pretrained('sent_small_bert_L8_512')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in category column
classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setMaxEpochs(5)\
    .setEnableOutputLogs(True)\
    .setLr(0.001)\
    .setBatchSize(1)

bert_sent_clf_pipeline = Pipeline(stages = [document,
                                            bert_sent,
                                            classsifierdl])

In [None]:
%%time
bert_sent_pipelineModel = bert_sent_clf_pipeline.fit(trainDataset)

In [None]:
from sklearn.metrics import classification_report

preds = bert_sent_pipelineModel.transform(testDataset)

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

BERT embeddings and generic annotator ClassifierDL gives accuracy of 89% without any pre-processing


#b, add pre-processing (lemmatization and stop-word removal) to a) individually and together

Set up pre-process of tokenize, etc

* Add stop word removal

In [None]:
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("token")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)

stop_word_pipeline = Pipeline(stages = [document,
                                        tokenizer,
                                        stopwords_cleaner,
                                        bert_sent,
                                        classsifierdl])



In [None]:
%%time
stop_word_pipeline_Model = stop_word_pipeline.fit(trainDataset)

In [None]:
preds = stop_word_pipeline_Model.transform(testDataset)

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

* Add lemmatization

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

lemma = LemmatizerModel.pretrained('lemma_antbnc') \
    .setInputCols(["token"]) \
    .setOutputCol("lemma")

bert_sent = BertSentenceEmbeddings.pretrained('sent_small_bert_L8_512')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setMaxEpochs(5)\
    .setEnableOutputLogs(True)\
    .setLr(0.001)

lemma_pipeline = Pipeline(stages = [documentAssembler,
                                    tokenizer,
                                    lemma,
                                    bert_sent,
                                    classsifierdl])

In [None]:
%%time
lemma_pipeline_Model = lemma_pipeline.fit(trainDataset)

In [None]:
preds = lemma_pipeline_Model.transform(testDataset)

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

* Add both stope-word removal and lemma to BERT pipeline

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("token")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)

lemma = LemmatizerModel.pretrained('lemma_antbnc') \
    .setInputCols(["token"]) \
    .setOutputCol("lemma")

bert_sent = BertSentenceEmbeddings.pretrained('sent_small_bert_L8_512')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setMaxEpochs(5)\
    .setEnableOutputLogs(True)\
    .setLr(0.001)

combined_pipeline = Pipeline(stages = [documentAssembler,
                                      tokenizer,
                                      lemma,
                                      stopwords_cleaner,
                                      bert_sent,
                                      classsifierdl])

In [None]:
%%time
combined_pipeline_Model = combined_pipeline.fit(trainDataset)

In [None]:
preds = combined_pipeline_Model.transform(testDataset)

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

#c, use RoBerta embeddings on top of best pipeline from b)

In [None]:
! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash

Run the command below ì roberta pipeline stuck at stopwords cleanẻ stage

In [None]:
!wget https://setup.johnsnowlabs.com/colab.sh -O - | bash /dev/stdin -p 3.3.0 -s 4.3.2 -g

In [None]:
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

stopwords_cleaner = StopWordsCleaner()\
    .setInputCols("token")\
    .setOutputCol("cleanTokens")\
    .setCaseSensitive(False)

lemma = LemmatizerModel.pretrained('lemma_antbnc') \
    .setInputCols(["token"]) \
    .setOutputCol("lemma")

#roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_embeddings_distilroberta_base","en") \
#    .setInputCols(["document", "token"]) \
#    .setOutputCol("sentence_embeddings")

roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base", "en") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setMaxEpochs(5)\
    .setEnableOutputLogs(True)\
    .setLr(0.001)\
    .setBatchSize(1)

roberta_pipeline = Pipeline(stages = [document,
                                      tokenizer,
                                      lemma,
                                      stopwords_cleaner,
                                      roberta_embeddings,
                                      classsifierdl])

In [None]:
%%time
roberta_pipeline_Model = roberta_pipeline.fit(trainDataset)

In [None]:
preds = roberta_pipeline_Model.transform(testDataset)

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))