![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/Spark_NLP_Udemy_MOOC/Open_Source/20.01.ClassifierDLApproach.ipynb)

# **ClassifierDLApproach**

This notebook will cover the different parameters and usages of `ClassifierDLApproach`. 

`ClassifierDLApproach` annotator provides the ability to train deep learning models that can make text classifications - assign tags or categories to text. 


**📖 Learning Objectives:**

1. Understand how `ClassifierDLApproach` algorithm works.

2. Understand how `ClassifierDLApproach` follows an unsupervised approach which builds upon features extracted from the text.

3. Become comfortable using the different parameters of the annotator.


**🔗 Helpful Links:**

- Documentation : [ClassifierDL](https://nlp.johnsnowlabs.com/docs/en/annotators#classifierdl)

- Python Docs : [ClassifierDLApproach](https://nlp.johnsnowlabs.com/api/python/reference/autosummary/sparknlp/annotator/classifier_dl/classifier_dl/index.html#sparknlp.annotator.classifier_dl.classifier_dl.ClassifierDLApproach)

- Scala Docs : [ClassifierDLApproach](https://nlp.johnsnowlabs.com/api/com/johnsnowlabs/nlp/annotators/classifier/dl/ClassifierDLApproach.html)

- For extended examples of usage, see the [Spark NLP Workshop repository](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.Text_Classification_with_ClassifierDL.ipynb).

- For additional information, see [Text Classification in Spark NLP with Bert and Universal Sentence Encoders](https://towardsdatascience.com/text-classification-in-spark-nlp-with-bert-and-universal-sentence-encoders-e644d618ca32).


ClassifierDL architecture is discussed in the [Mining Adverse Drug Reactions from Unstructured Mediums at Scale](https://arxiv.org/abs/2201.01405) research paper.

According to the researchers, to be able to process large volume of data, a text classification model needs to be scalable, and accurate, as it is used to filter out documents, reviews, and tweets that do not contain any indication of adverse event. To achieve this, John Snow Labs team used a Fully Connected Neural Network (FCNN) model that does not require hand-crafted features, and relies on a single embedding vector for classification.


Given the conversational nature of social media text, it is possible to
utilise the entire document to get efficient embeddings (with little text clipping in case of BioBERT embeddings) that are directly fed to the classifier model. Since there is only a single feature vector as input to the model, multiple embedding techniques were tested to analyse performance.
  

## **📜 Background**


`ClassifierDLApproach` is a generic multi-class text classifier trainer annotator in Spark NLP and it uses various text embeddings as an input for text classifications. The `ClassifierDL` annotator uses a deep learning model (DNNs) that is built inside TensorFlow and supports up to 100 classes.

It is possible to train a text classifier model with Bert, Elmo, Glove and Universal Sentence Encoders in Spark NLP using the `ClassiferDLApproach` annotator.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length text and the output vector dimension depends on the type of embeddings that were used during training - typically ranges from 100 to 768 dimensions. 




## **🎬 Colab Setup**

In [None]:
! pip install -q pyspark==3.1.2  spark-nlp==4.2.4

In [2]:
import sparknlp

import sys
sys.path.append('../../')

import sparknlp

from sparknlp.base import LightPipeline
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.functions import array_contains
from sparknlp.annotator import *
from sparknlp.common import RegexRule
from sparknlp.base import DocumentAssembler, Finisher
import pandas as pd
import pyspark.sql.functions as F

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 4.2.4
Apache Spark version: 3.1.2


## **🖨️ Input/Output Annotation Types**

- Input: `SENTENCE EMBEDDINGS`

- Output: `CATEGORY`

## **🔎 Parameters**



### **`setBatchSize`**

In deep learning, batch size refers to the number of training examples used in one iteration of gradient descent. During training, the training data is divided into batches, and each batch is fed through the neural network to compute the loss and update the weights. 

`setBatchSize` is a hyperparameter that can be tuned to achieve the best performance. Increasing the batch size can reduce the overall training time, but it may also require more memory and computational resources. Conversely, reducing the batch size may increase the training time but it can reduce the memory usage and allow the model to fit into memory.

By default, the setBatchSize parameter is set to `64` in `ClassifierDLApproach`. This default value is suitable for small datasets, but larger datasets may require larger batch sizes for optimal performance.

### **`setLr`**

Lr stands for **Learning Rate**. It is a hyperparameter that determines the step size at which the model weights are updated during the optimization process in model training.

During training, the model adjusts its weights to minimize the loss function, which measures the difference between the predicted and actual labels for each input sentence. The learning rate determines the size of the weight update at each iteration. A larger learning rate results in larger weight updates, while a smaller learning rate results in smaller weight updates.

`setLr` parameter can significantly affect the training performance and the convergence speed of the model. A learning rate that is too high can cause the model to overshoot the optimal weights and fail to converge, while a learning rate that is too low can result in slow convergence or getting stuck in a local minimum.

The optimal value for the `setLr` parameter depends on the specific task and the characteristics of the dataset. It is often necessary to experiment with different learning rates to find the optimal value.

### **`setThreshold`**

`setThreshold` is used to set the classification probability threshold for a binary classifier.

In text classification, a model is trained to classify input data into one of several possible classes. After training, the model will predict a probability distribution over the classes for each input data point, where the predicted probability for each class represents the model's confidence that the input data point belongs to that class.

`setThreshold` is used to set a threshold value for the predicted probabilities. Predicted probabilities below this threshold will be considered as belonging to the negative class, while any probabilities above the threshold will be considered as belonging to the positive class. This threshold is used to convert the probability distribution output by the model into a binary classification result.


### **`setMaxEpochs`**

`setMaxEpochs` specifies the maximum number of training epochs (iterations) to run during the training process, after which the training will stop regardless of whether the model has converged or not.

An epoch is a complete iteration over the training data, which means that the model sees every example in the training data exactly once during that epoch. 

During training, the model parameters are updated based on the error between the predicted output and the true output for each example in the training data. As the number of epochs increases, the model has more opportunities to adjust its parameters and reduce its error on the training data.

Setting a larger number of epochs may improve the performance of the model, but it can also increase the training time and the risk of overfitting the model to the training data. Conversely, setting a smaller number of epochs may result in a faster training time but a suboptimal model performance.

### **`setVerbose`**

`setVerbose` controls the amount of information displayed during the training process. This information can include metrics such as loss, accuracy, and other performance indicators that are calculated on the training data during each epoch.

When `setVerbose` is set to True, the training log will display more detailed information about the training process, including the loss and accuracy values for each epoch, the training time for each epoch, and other relevant metrics. 

When `setVerbose` is set to False (which is the default), the training log will be less verbose and will only display a summary of the training process, including the final accuracy value and the total training time.

### **`setEnableOutputLogs`**

`setEnableOutputLogs` allows to enable or disable the logging of various information during the training process.

When `setEnableOutputLogs` is `True`, the model will log information about the training progress, such as the number of epochs completed, the current loss value, and the F1 score. 

This information can be useful for monitoring the training process and evaluating the performance of the model.


### **`setOutputLogsPath`**

`setOutputLogsPath` allows to specify a path to a directory where the model will write the logs during the training process.

When `setEnableOutputLogs` is True and `setOutputLogsPath`("path/to/logs") are called, the model will log the training progress information to files in the specified directory. 

This can be useful for keeping track of the training progress of multiple models, comparing the performance of different models, and identifying potential issues or errors during the training process.

### **`setLabelColumn`**

`setLabelColumn` allows to specify the name of the column in the input DataFrame that contains the labels.

When `setLabelColumn` is called with the name of a valid column in the input DataFrame, the ClassifierDL will use the values in that column as the labels for the training data. 

### **`setRandomSeed`**

`setRandomSeed` allows to specify the random seed used by the model during training.

When you train a model, the initialization of the model's weights and biases can have a significant impact on the model's performance. To ensure that your model's initialization is reproducible, you can set a random seed that is used by the model during training. 

By setting a random seed, you can ensure that your model is initialized in the same way each time you train it, which can help you to reproduce your results and debug any issues that you may encounter.

### **`setTestDataset`**

`setTestDataset` sets the test dataset for the model. The test dataset is a set of data that is not used during training, but is used to evaluate the performance of the model after training. 

It is important to evaluate the model on a separate test dataset to ensure that it is able to generalize well to new, unseen data.

### **`setValidationSplit`**

`setValidationSplit` allows to specify the fraction of the training dataset that you want to use for validation during model training.

`setValidationSplit` allows to split the dataset into a training set and a validation set, where the model is trained on the training set and evaluated on the validation set during each epoch of training.

### **`setEvaluationLogExtended`**

`setEvaluationLogExtended` enables or disables extended evaluation log during training.

During training, the performance of the model is evaluated on a validation set after each epoch. By default, the evaluation log contains only basic information such as precision, recall, and F1-score. 

If `setEvaluationLogExtended` is set to True, the evaluation log will also contain additional information such as per-entity precision, recall, and F1-score.

### **`setConfigProtoBytes`**

`setConfigProtoBytes` allows users to set the TensorFlow configuration proto bytes for the underlying neural network used in the model.

The TensorFlow configuration proto bytes specify the configuration settings for the TensorFlow runtime, which can include options such as memory allocation, CPU/GPU usage, and other performance-related settings. 

### **`setDropout`**

`setDropout` controls the rate at which randomly selected neurons in the neural network are "dropped out" during training. Dropout is a regularization technique used in deep learning to prevent overfitting.

A higher dropout rate will drop out more neurons and result in a more regularized model (possibility of underfitting), but it may also reduce the accuracy of the model on the training data.


## **💻 Train a Model with ClassifierDLApproach**


`ClassifierDLApproach` is used for training a text classification model.

First step is to load the **training** and **test** datasets.

In [3]:
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv
! wget -q https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_test.csv

In [4]:
trainDataset_full = spark.read \
      .option("header", True) \
      .csv("news_category_train.csv")

trainDataset_full.show(10, truncate=130)

+--------+----------------------------------------------------------------------------------------------------------------------------------+
|category|                                                                                                                       description|
+--------+----------------------------------------------------------------------------------------------------------------------------------+
|Business|                                              Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again.|
|Business| Private investment firm Carlyle Group, which has a reputation for making well timed and occasionally controversial plays in th...|
|Business| Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market ne...|
|Business| Authorities have halted oil export flows from the main pipeline in southern Iraq after intelligence showed a rebel militia cou...|
|Busin

In [None]:
trainDataset_full.groupBy('category').count().show()

+--------+-----+
|category|count|
+--------+-----+
|   World|30000|
|Sci/Tech|30000|
|  Sports|30000|
|Business|30000|
+--------+-----+



There will be a total of **120,000** labelled rows of data for training the model.

In order to save time, we will slice a portion of the dataset (20,000) and use it for training the models. 
<br/>

In [5]:
trainDataset = trainDataset_full.limit(20000)

In [6]:
trainDataset.groupBy('category').count().show()

+--------+-----+
|category|count|
+--------+-----+
|Business| 4665|
|Sci/Tech| 5126|
|  Sports| 4982|
|   World| 5227|
+--------+-----+



Now, we have a **Training Dataset** of **20,000** rows and the distribution is reasonable, although not perfect.

<br/>

In [7]:
testDataset = spark.read \
      .option("header", True) \
      .csv("news_category_test.csv")

testDataset.show(10, truncate=130)

+--------+----------------------------------------------------------------------------------------------------------------------------------+
|category|                                                                                                                       description|
+--------+----------------------------------------------------------------------------------------------------------------------------------+
|Business|   Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.|
|Sci/Tech| TORONTO, Canada    A second team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for privately funde...|
|Sci/Tech| A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing bette...|
|Sci/Tech| It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already ...|
|Sci/T

In [8]:
testDataset.groupBy('category').count().show()

+--------+-----+
|category|count|
+--------+-----+
|   World| 1900|
|Sci/Tech| 1900|
|  Sports| 1900|
|Business| 1900|
+--------+-----+



**7600** labelled text will be used for the testing of the performance of the trained model.

<br/>

### **Pipeline Stages**

Text preprocessing has to be performed in order to prepare the text data for the model training. It is the very first step of NLP projects and some of the preprocessing steps are:



- [Tokenizer](https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer): Tokenization is the process of breaking down a text into smaller units called tokens, such as words, phrases, symbols, and other elements.
- [Normalizer](https://nlp.johnsnowlabs.com/docs/en/annotators#normalizer): is the annotator that removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary.
- [StopWordsCleaner](https://nlp.johnsnowlabs.com/docs/en/annotators#stopwordscleaner): is the annotator used for the process of removing commonly used words such as "the", "a", "an", and "is" from a text in order to improve the accuracy of natural language processing algorithms.
- [Lemmatizer](https://nlp.johnsnowlabs.com/docs/en/annotators#lemmatizer): is the annotator that reduces words to their base form, or lemma, in order to improve the accuracy of natural language processing algorithms.


In [None]:
document_assembler = DocumentAssembler() \
            .setInputCol("description") \
            .setOutputCol("document")
    
tokenizer = Tokenizer() \
            .setInputCols(["document"]) \
            .setOutputCol("token")
          
normalizer = Normalizer() \
            .setInputCols(["token"]) \
            .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
            .setInputCols("normalized")\
            .setOutputCol("cleanTokens")\
            .setCaseSensitive(False)

lemma = LemmatizerModel.pretrained('lemma_antbnc') \
            .setInputCols(["cleanTokens"]) \
            .setOutputCol("lemma")

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]


### **Sentence Embeddings**

**Sentence Embeddings** are used for text classification because they capture the semantic meaning of a sentence, allowing for more accurate classification of text.

The [SentenceEmbeddings](https://nlp.johnsnowlabs.com/docs/en/annotators#sentenceembeddings) annotator converts the results from WordEmbeddings by either summing up or averaging (`.setPoolingStrategy` was set to **AVERAGE** in this case) all the word embeddings in a sentence or a document.

In [None]:
glove_embeddings = WordEmbeddingsModel().pretrained() \
                        .setInputCols(["document",'lemma'])\
                        .setOutputCol("embeddings")\
                        .setCaseSensitive(False)

embeddingsSentence = SentenceEmbeddings() \
                        .setInputCols(["document", "embeddings"]) \
                        .setOutputCol("sentence_embeddings") \
                        .setPoolingStrategy("AVERAGE")

classsifierdl = ClassifierDLApproach()\
                        .setInputCols(["sentence_embeddings"])\
                        .setOutputCol("class")\
                        .setLabelColumn("category")\
                        .setMaxEpochs(10)\
                        .setEnableOutputLogs(True)
                      
clf_pipeline = Pipeline(
    stages=[
        document_assembler, 
        tokenizer,
        normalizer,
        stopwords_cleaner, 
        lemma, 
        glove_embeddings,
        embeddingsSentence,
        classsifierdl])

glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


### **Fit the Model**

Next step will be fitting the training dataset to train the model:

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(trainDataset)

CPU times: user 321 ms, sys: 43.8 ms, total: 365 ms
Wall time: 55.1 s


To get **information about the performance of the model**, including its accuracy and loss, we specify a folder and then observe the improvement of those metrics:

In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 10 - learning_rate: 0.005 - batch_size: 64 - training_examples: 20000 - classes: 4
Epoch 0/10 - 2.69s - loss: 280.25433 - acc: 0.839994 - batches: 313
Epoch 1/10 - 2.35s - loss: 267.71057 - acc: 0.86653644 - batches: 313
Epoch 2/10 - 2.36s - loss: 267.09302 - acc: 0.87394834 - batches: 313
Epoch 3/10 - 2.34s - loss: 266.17636 - acc: 0.8776042 - batches: 313
Epoch 4/10 - 2.46s - loss: 266.2502 - acc: 0.8795573 - batches: 313
Epoch 5/10 - 2.39s - loss: 266.58392 - acc: 0.8816106 - batches: 313
Epoch 6/10 - 2.31s - loss: 266.3976 - acc: 0.88316303 - batches: 313
Epoch 7/10 - 2.34s - loss: 265.8325 - acc: 0.8846655 - batches: 313
Epoch 8/10 - 2.33s - loss: 265.06653 - acc: 0.8868189 - batches: 313
Epoch 9/10 - 2.27s - loss: 264.38763 - acc: 0.8883714 - batches: 313



👍 Using the default settings, it was possible to get a final **accuracy** value of **~ 89 %** after 10 epochs. 

### **♻️ Transform & Get Predictions**

Now, in order to check the performance of the trained model, we will get predictions on the test dataset and then observe the predictions vs. ground truths.

In [None]:
preds = clf_pipelineModel.transform(testDataset)

In [None]:
preds.select('category','description',"class.result").show(10, truncate=120)

+--------+------------------------------------------------------------------------------------------------------------------------+----------+
|category|                                                                                                             description|    result|
+--------+------------------------------------------------------------------------------------------------------------------------+----------+
|Business|Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Fede...|[Business]|
|Sci/Tech| TORONTO, Canada    A second team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for priva...|[Sci/Tech]|
|Sci/Tech| A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of produ...|[Sci/Tech]|
|Sci/Tech| It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, bu...|[Sci/Tech]|

### **Test Set Evaluation**

The **Classification Report** below shows the quality of predictions of the trained model. 

In [None]:
from sklearn.metrics import classification_report

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

    Business       0.83      0.80      0.81      1900
    Sci/Tech       0.80      0.87      0.83      1900
      Sports       0.94      0.96      0.95      1900
       World       0.90      0.85      0.88      1900

    accuracy                           0.87      7600
   macro avg       0.87      0.87      0.87      7600
weighted avg       0.87      0.87      0.87      7600



In [None]:
classsifierdl.extractParamMap()

{Param(parent='ClassifierDLApproach_085f6c7c0fb1', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='ClassifierDLApproach_085f6c7c0fb1', name='maxEpochs', doc='Maximum number of epochs to train'): 10,
 Param(parent='ClassifierDLApproach_085f6c7c0fb1', name='lr', doc='Learning Rate'): 0.005,
 Param(parent='ClassifierDLApproach_085f6c7c0fb1', name='batchSize', doc='Batch size'): 64,
 Param(parent='ClassifierDLApproach_085f6c7c0fb1', name='dropout', doc='Dropout coefficient'): 0.5,
 Param(parent='ClassifierDLApproach_085f6c7c0fb1', name='enableOutputLogs', doc='Whether to use stdout in addition to Spark logs.'): True,
 Param(parent='ClassifierDLApproach_085f6c7c0fb1', name='evaluationLogExtended', doc='Whether logs for validation to be extended: it displays time and evaluation of each label. Default is False.'): False,
 Param(parent='ClassifierDLApproach_085f6c7c0fb1', name='inputCols', doc='previous annotations columns, if 

Those were the **(hyper)parameters** and the default values that the `ClassifierDLApproach`annotator can use.   

<br/>

Changing those values will have an impact on the accuracy of the **model** and the **time** needed to train a model.

<br/>

## **💻 Retrain the Model with Different Parameters - 1**

The process of retraining a deep learning model involves finding the set of weights and biases that minimize the loss function, which is a measure of how well the model is able to predict the correct output given the input data. As a result, the model's performance and metrics will improve. In practice, finding the optimal set of weights and biases can be a complex and computationally intensive process that requires many iterations.

When we retrain a deep learning model with new parameters, we are essentially restarting this process from scratch, using a new set of weights and biases that may be better suited to the problem at hand. For example, we might adjust the learning rate, regularization parameters, or other hyperparameters of the model to improve its performance on a specific task or dataset.


Instead of the default values, we will train the model with different parameter values, in order to see the effects of them on the accuracy, loss and time spent for training.

In this case, `setMaxEpochs` was increased to **20** and `setBatchSize` was decreased to **16**.


- `setMaxEpochs`: 20
- `setLr`: 0.005
- `setThreshold`: 0.5
- `setDropout`: 0.5
- `setBatchSize`: 16

**Number of Epochs** is an important parameter in deep learning because they represent a single iteration of the training process. Each epoch consists of one or more batches of data, and the model is trained on each batch until the entire dataset has been seen. This allows the model to learn from the data and improve its accuracy over time.

<br/>

**Batch Size** determines the number of samples that will be processed at once. A larger batch size can lead to faster training times, but can also lead to overfitting if the batch size is too large. A smaller batch size can lead to slower training times, but can also lead to better generalization if the batch size is small enough.

<br/>

In [None]:
classsifierdl = ClassifierDLApproach()\
                        .setInputCols(["sentence_embeddings"])\
                        .setOutputCol("class")\
                        .setLabelColumn("category")\
                        .setMaxEpochs(20)\
                        .setBatchSize(16)\
                        .setEnableOutputLogs(True)
                      
clf_pipeline = Pipeline(
    stages=[
        document_assembler, 
        tokenizer,
        normalizer,
        stopwords_cleaner, 
        lemma, 
        glove_embeddings,
        embeddingsSentence,
        classsifierdl])

In [None]:
! rm -r /root/annotator_logs

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(trainDataset)

CPU times: user 669 ms, sys: 75.4 ms, total: 744 ms
Wall time: 2min 9s


In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 20 - learning_rate: 0.005 - batch_size: 16 - training_examples: 20000 - classes: 4
Epoch 0/20 - 5.18s - loss: 1149.0548 - acc: 0.8361 - batches: 1250
Epoch 1/20 - 4.99s - loss: 1109.0306 - acc: 0.8641 - batches: 1250
Epoch 2/20 - 5.06s - loss: 1104.7867 - acc: 0.8707 - batches: 1250
Epoch 3/20 - 4.90s - loss: 1095.4418 - acc: 0.87535 - batches: 1250
Epoch 4/20 - 4.85s - loss: 1087.351 - acc: 0.8792 - batches: 1250
Epoch 5/20 - 5.00s - loss: 1086.1289 - acc: 0.88185 - batches: 1250
Epoch 6/20 - 4.94s - loss: 1084.6163 - acc: 0.8839 - batches: 1250
Epoch 7/20 - 5.05s - loss: 1082.5717 - acc: 0.88675 - batches: 1250
Epoch 8/20 - 5.12s - loss: 1080.0668 - acc: 0.88825 - batches: 1250
Epoch 9/20 - 5.06s - loss: 1077.6245 - acc: 0.89045 - batches: 1250
Epoch 10/20 - 5.04s - loss: 1075.7123 - acc: 0.8919 - batches: 1250
Epoch 11/20 - 4.95s - loss: 1073.9911 - acc: 0.89315 - batches: 1250
Epoch 12/20 - 4.92s - loss: 1072.4154 - acc: 0.89415 - batches: 1250
Epoch 13/2

👍 Using the new settings, it was possible to get a final **accuracy** value of **~ 90 %** after 20 epochs.

Default parameter values produced an **accuracy** value of **89 %**.

<br/>

In [None]:
preds = clf_pipelineModel.transform(testDataset)

In [None]:
from sklearn.metrics import classification_report

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

    Business       0.83      0.79      0.81      1900
    Sci/Tech       0.78      0.89      0.83      1900
      Sports       0.95      0.95      0.95      1900
       World       0.92      0.83      0.88      1900

    accuracy                           0.87      7600
   macro avg       0.87      0.87      0.87      7600
weighted avg       0.87      0.87      0.87      7600



## **💻 Retrain the Model with Different Parameters - 2**

In this training, `setMaxEpochs` and `setBatchSize` was kept to the previous value, but `setLr` was used as **0.001**.


- `setMaxEpochs`: 20
- `setLr`: 0.001
- `setThreshold`: 0.5
- `setDropout`: 0.5
- `setBatchSize`: 16

**Learning rate** is important in deep learning because it determines how quickly the model learns from the data. 

A higher learning rate can lead to faster training times, but can also lead to overfitting if the learning rate is too high. A lower learning rate can lead to slower training times, but can also lead to better generalization if the learning rate is low enough.

<br/>

In [None]:
classsifierdl = ClassifierDLApproach()\
                        .setInputCols(["sentence_embeddings"])\
                        .setOutputCol("class")\
                        .setLabelColumn("category")\
                        .setMaxEpochs(20)\
                        .setBatchSize(16)\
                        .setLr(0.001)\
                        .setEnableOutputLogs(True)
                      
clf_pipeline = Pipeline(
    stages=[
        document_assembler, 
        tokenizer,
        normalizer,
        stopwords_cleaner, 
        lemma, 
        glove_embeddings,
        embeddingsSentence,
        classsifierdl])

In [None]:
! rm -r /root/annotator_logs

In [None]:
%%time
clf_pipelineModel = clf_pipeline.fit(trainDataset)

CPU times: user 637 ms, sys: 87.7 ms, total: 725 ms
Wall time: 2min 6s


In [None]:
import os
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 20 - learning_rate: 0.001 - batch_size: 16 - training_examples: 20000 - classes: 4
Epoch 0/20 - 5.05s - loss: 1130.0118 - acc: 0.8423 - batches: 1250
Epoch 1/20 - 4.84s - loss: 1098.1505 - acc: 0.86365 - batches: 1250
Epoch 2/20 - 4.81s - loss: 1092.0223 - acc: 0.86915 - batches: 1250
Epoch 3/20 - 4.93s - loss: 1088.7266 - acc: 0.87165 - batches: 1250
Epoch 4/20 - 4.85s - loss: 1086.5955 - acc: 0.8738 - batches: 1250
Epoch 5/20 - 4.80s - loss: 1084.9662 - acc: 0.87565 - batches: 1250
Epoch 6/20 - 5.04s - loss: 1083.6718 - acc: 0.8762 - batches: 1250
Epoch 7/20 - 4.87s - loss: 1082.6428 - acc: 0.8771 - batches: 1250
Epoch 8/20 - 4.81s - loss: 1081.8224 - acc: 0.8777 - batches: 1250
Epoch 9/20 - 4.82s - loss: 1081.1245 - acc: 0.8781 - batches: 1250
Epoch 10/20 - 4.87s - loss: 1080.5321 - acc: 0.8788 - batches: 1250
Epoch 11/20 - 4.69s - loss: 1080.0076 - acc: 0.87955 - batches: 1250
Epoch 12/20 - 4.75s - loss: 1079.5336 - acc: 0.8805 - batches: 1250
Epoch 13/20

👎 Decreasing the `setLr` to **0.001** (default value is 0.005) caused a drop in the accuracy to **~ 88.3 %**, which was expected.

In [None]:
preds = clf_pipelineModel.transform(testDataset)

In [None]:
from sklearn.metrics import classification_report

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

    Business       0.84      0.78      0.81      1900
    Sci/Tech       0.79      0.87      0.83      1900
      Sports       0.94      0.96      0.95      1900
       World       0.90      0.86      0.88      1900

    accuracy                           0.87      7600
   macro avg       0.87      0.87      0.87      7600
weighted avg       0.87      0.87      0.87      7600



## **💻 Retrain the Model with Bert Embeddings**

**BERT** (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.

Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. 


Reference Academic Paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

<br/>

Notice that `SentenceEmbeddings` annotator was used after the BertEmbeddings; because sentence embeddings capture the semantic meaning of a sentence, allowing for more accurate classification of text.

In [None]:
document_assembler = DocumentAssembler() \
                .setInputCol("description") \
                .setOutputCol("document")

tokenizer = Tokenizer() \
                .setInputCols(["document"]) \
                .setOutputCol("token")
      
bert_embeddings = BertEmbeddings().pretrained(name='small_bert_L4_256', lang='en') \
                .setInputCols(["document",'token'])\
                .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
                .setInputCols(["document", "embeddings"]) \
                .setOutputCol("sentence_embeddings") \
                .setPoolingStrategy("AVERAGE")

classsifierdl = ClassifierDLApproach()\
                .setInputCols(["sentence_embeddings"])\
                .setOutputCol("class")\
                .setLabelColumn("category")\
                .setMaxEpochs(10)\
                .setLr(0.001)\
                .setBatchSize(8)\
                .setEnableOutputLogs(True)
                #.setOutputLogsPath('logs')

bert_clf_pipeline = Pipeline(stages=[document_assembler,
                                     tokenizer,
                                     bert_embeddings,
                                     embeddingsSentence,
                                     classsifierdl])

small_bert_L4_256 download started this may take some time.
Approximate size to download 40.5 MB
[OK!]


In [None]:
! rm -r /root/annotator_logs

In [None]:
%%time
bert_clf_pipelineModel = bert_clf_pipeline.fit(trainDataset)

CPU times: user 1.9 s, sys: 189 ms, total: 2.08 s
Wall time: 6min 34s


In [None]:
log_files = os.listdir("/root/annotator_logs")
log_files

['ClassifierDLApproach_872276d23607.log']

In [None]:
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 10 - learning_rate: 0.001 - batch_size: 8 - training_examples: 20000 - classes: 4
Epoch 0/10 - 8.79s - loss: 2284.2507 - acc: 0.8308 - batches: 2500
Epoch 1/10 - 8.57s - loss: 2220.583 - acc: 0.85855 - batches: 2500
Epoch 2/10 - 8.52s - loss: 2202.462 - acc: 0.86465 - batches: 2500
Epoch 3/10 - 8.53s - loss: 2191.5354 - acc: 0.8691 - batches: 2500
Epoch 4/10 - 8.56s - loss: 2183.644 - acc: 0.87325 - batches: 2500
Epoch 5/10 - 8.52s - loss: 2178.5251 - acc: 0.87545 - batches: 2500
Epoch 6/10 - 8.41s - loss: 2174.235 - acc: 0.8778 - batches: 2500
Epoch 7/10 - 8.62s - loss: 2170.365 - acc: 0.87995 - batches: 2500
Epoch 8/10 - 8.52s - loss: 2167.0063 - acc: 0.8816 - batches: 2500
Epoch 9/10 - 8.50s - loss: 2164.0303 - acc: 0.8839 - batches: 2500



To save time, we used the [Smaller BERT embeddings](https://nlp.johnsnowlabs.com/2020/08/25/small_bert_L4_256.html), which produce 256-dimensional vectors. 

Higher-dimensional BERT models may be tried, with a substantial **increase in training time**.

<br/>

In [None]:
from sklearn.metrics import classification_report

preds = bert_clf_pipelineModel.transform(testDataset)

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

    Business       0.81      0.79      0.80      1900
    Sci/Tech       0.80      0.85      0.83      1900
      Sports       0.94      0.93      0.93      1900
       World       0.88      0.85      0.87      1900

    accuracy                           0.86      7600
   macro avg       0.86      0.86      0.86      7600
weighted avg       0.86      0.86      0.86      7600



## **💻 Retrain the Model with BertSentenceEmbeddings**

BERT sentence embeddings are a type of sentence embeddings that use a deep learning model called BERT (Bidirectional Encoder Representations from Transformers) to generate numerical vectors that represent the meaning of a sentence. These vectors can then be used as input for machine learning algorithms to classify text.

- Documentation : [Bert Sentence Embeddings](https://nlp.johnsnowlabs.com/docs/en/transformers#bertsentenceembeddings)



In [9]:
document = DocumentAssembler()\
              .setInputCol("description")\
              .setOutputCol("document")
    
bert_cmlm = BertSentenceEmbeddings.pretrained('sent_small_bert_L8_512', 'en')\
              .setInputCols(["document"])\
              .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLApproach()\
              .setInputCols(["sentence_embeddings"])\
              .setOutputCol("class")\
              .setLabelColumn("category")\
              .setMaxEpochs(3)\
              .setBatchSize(64)\
              .setEnableOutputLogs(True)\
              .setLr(0.001)

bert_cmlm_clf_pipeline = Pipeline(stages = [document,
                                            bert_cmlm,
                                            classsifierdl])

sent_small_bert_L8_512 download started this may take some time.
Approximate size to download 149.1 MB
[OK!]


In [None]:
! rm -r /root/annotator_logs

In [11]:
%%time
bert_cmlm_pipelineModel = bert_cmlm_clf_pipeline.fit(trainDataset)

CPU times: user 3.06 s, sys: 390 ms, total: 3.45 s
Wall time: 9min 33s


In [14]:
log_files = os.listdir("/root/annotator_logs")

with open("/root/annotator_logs/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 3 - learning_rate: 0.001 - batch_size: 64 - training_examples: 20000 - classes: 4
Epoch 0/3 - 1.82s - loss: 297.21375 - acc: 0.85026044 - batches: 313
Epoch 1/3 - 1.47s - loss: 284.20157 - acc: 0.87479967 - batches: 313
Epoch 2/3 - 1.44s - loss: 282.48975 - acc: 0.8808594 - batches: 313



In [15]:
from sklearn.metrics import classification_report

preds = bert_cmlm_pipelineModel.transform(testDataset)

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

    Business       0.83      0.81      0.82      1900
    Sci/Tech       0.82      0.87      0.85      1900
      Sports       0.94      0.96      0.95      1900
       World       0.90      0.85      0.88      1900

    accuracy                           0.87      7600
   macro avg       0.87      0.87      0.87      7600
weighted avg       0.87      0.87      0.87      7600



Training with BERT Sentence Embeddings on large datasets requires some time, so we used only **3 epochs**, but still got a nice accuracy value of **87 %**.

## **💻 Retrain the Model with Universal Sentence Encoder**

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

- Documentation : [Universal Sentence Encoder](https://nlp.johnsnowlabs.com/docs/en/transformers#universalsentenceencoder)

- Reference Academic Paper: [Universal Sentence Encoder](https://arxiv.org/abs/1803.11175)

In [None]:
document = DocumentAssembler()\
                  .setInputCol("description")\
                  .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained("tfhub_use_lg", "en") \
                  .setInputCols("document") \
                  .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLApproach()\
                  .setInputCols(["sentence_embeddings"])\
                  .setOutputCol("class")\
                  .setLabelColumn("category")\
                  .setMaxEpochs(5)\
                  .setLr(0.001)\
                  .setBatchSize(16)\
                  .setEnableOutputLogs(True)

use_clf_pipeline = Pipeline(stages = [document,
                                      use,
                                      classsifierdl])

tfhub_use_lg download started this may take some time.
Approximate size to download 753.3 MB
[OK!]


In [None]:
! rm -r /root/annotator_logs

In [None]:
%%time 
use_pipelineModel = use_clf_pipeline.fit(trainDataset)

CPU times: user 5.34 s, sys: 594 ms, total: 5.93 s
Wall time: 15min 25s


In [None]:
log_files = os.listdir("/root/annotator_logs")
log_files

['ClassifierDLApproach_138fcac69004.log']

In [None]:
with open("/root/annotator_logs/"+log_files[0], "r") as log_file :
    print(log_file.read())

Training started - epochs: 5 - learning_rate: 0.001 - batch_size: 16 - training_examples: 20000 - classes: 4
Epoch 0/5 - 7.22s - loss: 1111.947 - acc: 0.86825 - batches: 1250
Epoch 1/5 - 6.41s - loss: 1082.4929 - acc: 0.8898 - batches: 1250
Epoch 2/5 - 6.73s - loss: 1078.3021 - acc: 0.8929 - batches: 1250
Epoch 3/5 - 6.33s - loss: 1075.4481 - acc: 0.8955 - batches: 1250
Epoch 4/5 - 6.62s - loss: 1073.1448 - acc: 0.897 - batches: 1250



With just **5 Epochs**, a model using Universal Sentence Encoders provided an accuracy value of **~ 90 %**. 

Considering the loss value, there is still room for improvement of this model. 

In [None]:
from sklearn.metrics import classification_report

preds = use_pipelineModel.transform(testDataset)

preds_df = preds.select('category','description',"class.result").toPandas()

preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

print (classification_report(preds_df['category'], preds_df['result']))

              precision    recall  f1-score   support

    Business       0.84      0.82      0.83      1900
    Sci/Tech       0.84      0.89      0.86      1900
      Sports       0.95      0.98      0.96      1900
       World       0.92      0.87      0.90      1900

    accuracy                           0.89      7600
   macro avg       0.89      0.89      0.89      7600
weighted avg       0.89      0.89      0.89      7600



## **📊 Getting Predictions from  the Trained Model**


### **🔦 Light Pipeline**

[LightPipeline](https://nlp.johnsnowlabs.com/docs/en/concepts#using-spark-nlps-lightpipeline) is a Spark NLP specific Pipeline class equivalent to the Spark ML Pipeline. 

The difference is that its execution does not hold to Spark principles, instead it computes everything locally (but in parallel) in order to achieve **fast results** when dealing with **small amounts of data**.

In [None]:
light_model = LightPipeline(use_pipelineModel)

In [None]:
text='''
Fearing the fate of Italy, the centre-right government has threatened to be merciless with those who flout tough restrictions. 
As of Wednesday it will also include all shops being closed across Greece, with the exception of supermarkets. Banks, pharmacies, pet-stores, mobile phone stores, opticians, bakers, mini-markets, couriers and food delivery outlets are among the few that will also be allowed to remain open.
'''
result = light_model.annotate(text)

result['class']

['Business']

In [None]:
light_model.annotate('the soccer games will be postponed.')['class']

['Sports']

### **♻️ Transform**

In Spark NLP, in order to put all the steps in stages and then retrieve the results, a **pipeline** is used. 

**fit()** will fit the model to the input training instances while **transform()** will perform predictions on the testing instances, based on the learned parameters during fitting. 

In [None]:
text_list=["""Fearing the fate of Italy, the centre-right government has threatened to be merciless with those who flout tough restrictions.""", 
"""As of Wednesday it will also include all shops being closed across Greece, with the exception of supermarkets. Banks, pharmacies, pet-stores, mobile phone stores, opticians, bakers, mini-markets, couriers and food delivery outlets are among the few that will also be allowed to remain open.""",
"""the soccer games will be postponed."""]

In [None]:
df = spark.createDataFrame(pd.DataFrame({'description':text_list}))

In [None]:
result = use_pipelineModel.transform(df)

In [None]:
result.select(F.explode(F.arrays_zip('document.result', 'class.result')).alias("cols")) \
.select(F.expr("cols['0']").alias("document"),
        F.expr("cols['1']").alias("class")).show(truncate=150)

+------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|                                                                                                                                              document|   class|
+------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
|                        Fearing the fate of Italy, the centre-right government has threatened to be merciless with those who flout tough restrictions.|   World|
|As of Wednesday it will also include all shops being closed across Greece, with the exception of supermarkets. Banks, pharmacies, pet-stores, mobil...|Business|
|                                                                                                                   the soccer games will be postponed.|  Sports|
+---------------------------

## 💾 **Saving & Loading Back the Trained Model**

It is possible to save the trained model locally, with the chance to load and get predictions later.

<br/>

In [None]:
use_pipelineModel.stages

[DocumentAssembler_69248d8ba6cc,
 UNIVERSAL_SENTENCE_ENCODER_5e0d8b922c74,
 ClassifierDLModel_0142b50a8788]

In [None]:
# Save the Model
use_pipelineModel.stages[2].write().overwrite().save('useClassifierDL')

In [None]:
# Load back the saved Model
Classifier_Model = ClassifierDLModel.load('useClassifierDL')

In [None]:
# Generate prediction Pipeline with loaded Model 
ld_pipeline = Pipeline(stages=[document, use, Classifier_Model])
ld_pipeline_model = ld_pipeline.fit(spark.createDataFrame([['']]).toDF("description"))

In [None]:
# Apply Model Transform to testData
ld_preds = ld_pipeline_model.transform(testDataset) 

In [None]:
ld_preds.select('description','category',"class.result").show(10, truncate = 130)

+----------------------------------------------------------------------------------------------------------------------------------+--------+----------+
|                                                                                                                       description|category|    result|
+----------------------------------------------------------------------------------------------------------------------------------+--------+----------+
|   Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.|Business|[Business]|
| TORONTO, Canada    A second team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for privately funde...|Sci/Tech|[Sci/Tech]|
| A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing bette...|Sci/Tech|[Sci/Tech]|
| It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful 