![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/15.BertForSequenceClassification_In_Spark_NLP.ipynb)

# BertForSequenceClassification

BertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

Pretrained models can be loaded with `pretrained` of the companion object.

## Colab Setup

In [None]:
! pip install -q pyspark==3.2.0 spark-nlp

In [None]:
import sparknlp

spark = sparknlp.start(spark32=True)

from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import pandas as pd

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 3.4.0
Apache Spark version: 3.2.0


**BertForSequenceClassification Models**


*   `bert_sequence_classifier_toxicity`
*   `bert_sequence_classifier_sentiment`
*   `bert_base_sequence_classifier_imdb`
*   `bert_large_sequence_classifier_imdb`
*   `bert_multilingual_sequence_classifier_allocine`
*   `bert_base_sequence_classifier_ag_news`
*   `bert_base_sequence_classifier_dbpedia_14`


**AlbertForSequenceClassification Models**


*   `albert_base_sequence_classifier_imdb`
*   `albert_base_sequence_classifier_ag_news`

**DistilBertForSequenceClassification Models**



*   `distilbert_base_sequence_classifier_ag_news`
*   `distilbert_base_sequence_classifier_amazon_polarity`
*   `distilbert_base_sequence_classifier_imdb`
*   `distilbert_base_sequence_classifier_imdb`
*   `distilbert_multilingual_sequence_classifier_allocine`
*   `distilbert_sequence_classifier_banking77`
*   `distilbert_sequence_classifier_emotion`
*   `distilbert_sequence_classifier_industry`
*   `distilbert_sequence_classifier_policy`
*   `distilbert_sequence_classifier_sst2`



**RoBertaForSequenceClassification Models**


*   `roberta_base_sequence_classifier_imdb`
*   `roberta_base_sequence_classifier_ag_news`

**XlmRoBertaForSequenceClassification Models**



*   `xlm_roberta_base_sequence_classifier_allocine`
*   `xlm_roberta_base_sequence_classifier_imdb`
*   `xlm_roberta_base_sequence_classifier_ag_news`


**XlnetForSequenceClassification Models**



*   `xlnet_base_sequence_classifier_imdb`
*   `xlnet_base_sequence_classifier_ag_news`


You can find all these models and more here in [Spark NLP Models Hub](https://nlp.johnsnowlabs.com/models?task=Text+Classification)


## BertForSequenceClassification Pipeline

Now, let's create a Spark NLP Pipeline with `bert_base_sequence_classifier_imdb` model and check the results. 

This model is a fine-tuned BERT model that is ready to be used for Sequence Classification tasks such as sentiment analysis or multi-class text classification and it achieves state-of-the-art performance.

This model has been trained to recognize two types of entities: negative (neg), positive (pos)



In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

sequenceClassifier = BertForSequenceClassification \
    .pretrained('bert_base_sequence_classifier_imdb', 'en') \
    .setInputCols(['token', 'document']) \
    .setOutputCol('pred_class') \
    .setCaseSensitive(True) \
    .setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

sample_text= [["I really liked that movie!"], ["The last movie I watched was awful!"]]
sample_df= spark.createDataFrame(sample_text).toDF("text")
model = pipeline.fit(sample_df)
result= model.transform(sample_df)

bert_base_sequence_classifier_imdb download started this may take some time.
Approximate size to download 387.6 MB
[OK!]


In [None]:
model.stages

[DocumentAssembler_3f66cce37f81,
 REGEX_TOKENIZER_92254d4a296b,
 BERT_FOR_SEQUENCE_CLASSIFICATION_41f87e548530]

We can check the classes of `bert_base_sequence_classifier_imdb` model by using `getClasses()` function.

In [None]:
sequenceClassifier.getClasses()

['neg', 'pos']

In [None]:
result.columns

['text', 'document', 'token', 'pred_class']

In [None]:
result.printSchema()

root
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 

In [None]:
result_df= result.select(F.explode(F.arrays_zip(result.document.result, result.pred_class.result)).alias("col"))\
                 .select(F.expr("col['0']").alias("sentence"),
                         F.expr("col['1']").alias("prediction"))
                  
result_df.show(truncate=False)

+-----------------------------------+----------+
|sentence                           |prediction|
+-----------------------------------+----------+
|I really liked that movie!         |pos       |
|The last movie I watched was awful!|neg       |
+-----------------------------------+----------+



## BertForSequenceClassification By Using LightPipeline

Now, we will use our model with LightPipeline.

In [None]:
from sparknlp.base import LightPipeline

In [None]:
document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

sequenceClassifier = BertForSequenceClassification \
    .pretrained('bert_base_sequence_classifier_imdb', 'en') \
    .setInputCols(['token', 'document']) \
    .setOutputCol('pred_class') \
    .setCaseSensitive(True) \
    .setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

empty_df = spark.createDataFrame([['']]).toDF("text")
model = pipeline.fit(empty_df)

bert_base_sequence_classifier_imdb download started this may take some time.
Approximate size to download 387.6 MB
[OK!]


We've built our model, let's use LightPipeline and fullAnnotate it with sample data.

In [None]:
light_model= LightPipeline(model)
light_result= light_model.fullAnnotate("It has an awesome ending, I wish I had watched that movie earlier.")[0]

In [None]:
light_result

{'document': [Annotation(document, 0, 65, It has an awesome ending, I wish I had watched that movie earlier., {})],
 'pred_class': [Annotation(category, 0, 65, pos, {'sentence': '0', 'Some(neg)': '0.021220155', 'Some(pos)': '0.97877985'})],
 'token': [Annotation(token, 0, 1, It, {'sentence': '0'}),
  Annotation(token, 3, 5, has, {'sentence': '0'}),
  Annotation(token, 7, 8, an, {'sentence': '0'}),
  Annotation(token, 10, 16, awesome, {'sentence': '0'}),
  Annotation(token, 18, 23, ending, {'sentence': '0'}),
  Annotation(token, 24, 24, ,, {'sentence': '0'}),
  Annotation(token, 26, 26, I, {'sentence': '0'}),
  Annotation(token, 28, 31, wish, {'sentence': '0'}),
  Annotation(token, 33, 33, I, {'sentence': '0'}),
  Annotation(token, 35, 37, had, {'sentence': '0'}),
  Annotation(token, 39, 45, watched, {'sentence': '0'}),
  Annotation(token, 47, 50, that, {'sentence': '0'}),
  Annotation(token, 52, 56, movie, {'sentence': '0'}),
  Annotation(token, 58, 64, earlier, {'sentence': '0'}),
  A

In [None]:
light_result.keys()

dict_keys(['document', 'token', 'pred_class'])

Let's check the prediction

In [None]:
pd.set_option('display.max_colwidth', None)

text= []
pred= []

for i, k in list(zip(light_result["document"], light_result["pred_class"])):
  text.append(i.result)
  pred.append(k.result)

result_df= pd.DataFrame({"text": text, "prediction": pred})
result_df.head()

Unnamed: 0,text,prediction
0,"It has an awesome ending, I wish I had watched that movie earlier.",pos
