![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/training/english/classification/ClassifierDL_Train_and_Evaluate.ipynb)

# Multi-class Text Classification using ClassifierDL

Only run this block if you are inside Google Colab otherwise skip it

In [None]:
# Only run this cell when you are using Spark NLP on Google Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

In this notebook we are going to check the training logs on the fly. Thus, we start a session with `real_time_output=True`

In [None]:
import sparknlp

spark = sparknlp.start(real_time_output=True)

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version; ", spark.version)

Spark NLP version:  4.3.1
Apache Spark version;  3.3.0
23/02/20 17:36:52 WARN Utils: Your hostname, duc-manjaro resolves to a loopback address: 127.0.1.1; using 192.168.0.34 instead (on interface enp3s0)
23/02/20 17:36:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/home/root/.conda/envs/sparknlp/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
23/02/20 17:36:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/02/20 17:36:55 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
23/02/20 17:36:55 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


Let's download news category dataset for training our text classifier

In [None]:
!wget -O news_category_train.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv

--2023-02-20 17:37:03--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.142.224, 52.216.50.8, 52.216.162.69, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.142.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24032125 (23M) [text/csv]
Saving to: ‘news_category_train.csv’


2023-02-20 17:37:05 (13,1 MB/s) - ‘news_category_train.csv’ saved [24032125/24032125]



In [None]:
!wget -O news_category_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv

--2023-02-20 17:37:06--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.21.181, 54.231.130.208, 52.217.205.32, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.21.181|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1504408 (1,4M) [text/csv]
Saving to: ‘news_category_test.csv’


2023-02-20 17:37:07 (2,09 MB/s) - ‘news_category_test.csv’ saved [1504408/1504408]



In [None]:
!head news_category_train.csv

category,description
Business," Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again."
Business," Private investment firm Carlyle Group, which has a reputation for making well timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market."
Business, Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.
Business," Authorities have halted oil export flows from the main pipeline in southern Iraq after intelligence showed a rebel militia could strike infrastructure, an oil official said on Saturday."
Business," Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections."
Business," Stocks ended slightly higher on Friday but stayed near lows for the year as oil prices surged past  #36;

The content is inside `description` column and the labels are inside `category` column

In [None]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("news_category_train.csv")

In [None]:
trainDataset.show()

+--------+--------------------+
|category|         description|
+--------+--------------------+
|Business| Short sellers, W...|
|Business| Private investme...|
|Business| Soaring crude pr...|
|Business| Authorities have...|
|Business| Tearaway world o...|
|Business| Stocks ended sli...|
|Business| Assets of the na...|
|Business| Retail sales bou...|
|Business|" After earning a...|
|Business| Short sellers, W...|
|Business| Soaring crude pr...|
|Business| OPEC can do noth...|
|Business| Non OPEC oil exp...|
|Business| WASHINGTON/NEW Y...|
|Business| The dollar tumbl...|
|Business|If you think you ...|
|Business|The purchasing po...|
|Business|There is little c...|
|Business|The US trade defi...|
|Business|Oil giant Shell c...|
+--------+--------------------+
only showing top 20 rows



In [None]:
trainDataset.count()

120000

In [None]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

# Prepare TestDataset for Evaluation 

Let's evaluate our ClassifierDL model during training, saved it, and loaded it into a new pipeline by using a test dataset that model has never seen. To do this we first need to prepare a test dataset parquet file as shown below:

In [None]:
news_test_dataset = spark.read \
      .option("header", True) \
      .csv("news_category_test.csv")

In [None]:
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

pipeline = Pipeline(stages = [document,use])

test_dataset = pipeline.fit(news_test_dataset).transform(news_test_dataset)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ]tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ / ]Download done! Loading the resource.
[OK!]


In [None]:
test_dataset.show(2)

+--------+--------------------+--------------------+--------------------+
|category|         description|            document| sentence_embeddings|
+--------+--------------------+--------------------+--------------------+
|Business|Unions representi...|[{document, 0, 12...|[{sentence_embedd...|
|Sci/Tech| TORONTO, Canada ...|[{document, 0, 22...|[{sentence_embedd...|
+--------+--------------------+--------------------+--------------------+
only showing top 2 rows



Now, that out test dataset has the required embeddings, we save it as parquet and use it while training our ClassifierDL model.

In [None]:
test_dataset.write.parquet("./test_news.parquet")

Now let's train it and use a validation and the test dataset above for evaluation

In [None]:
classsifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("category")\
  .setMaxEpochs(5)\
  .setEnableOutputLogs(True) \
  .setEvaluationLogExtended(True) \
  .setValidationSplit(0.2) \
  .setTestDataset("./test_news.parquet")

pipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])

In [None]:
pipelineModel = pipeline.fit(trainDataset)

Training started - epochs: 5 - learning_rate: 0.005 - batch_size: 64 - training_examples: 96000 - classes: 4
Epoch 1/5 - 6.34s - loss: 1305.07 - acc: 0.8793854 - batches: 1500
Quality on validation dataset (20.0%), validation examples = 24000
time to finish evaluation: 0.44s
label      tp	 fp	 fn	 prec	 rec	 f1
Sci/Tech   4907	 753	 1013	 0.8669611	 0.82888514	 0.8474956
Business   5201	 1162	 806	 0.81738174	 0.8658232	 0.8409054
Sports     5891	 350	 118	 0.94391924	 0.9803628	 0.96179587
World      5272	 464	 792	 0.9191074	 0.8693931	 0.8935593
tp: 21271 fp: 2729 fn: 2729 labels: 4
Macro-average	 prec: 0.88684237, rec: 0.8861161, f1: 0.886479
Micro-average	 prec: 0.8862917, recall: 0.8862917, f1: 0.8862917
Quality on test dataset: 
time to finish evaluation: 0.07s
label      tp	 fp	 fn	 prec	 rec	 f1
Sci/Tech   1579	 247	 321	 0.86473167	 0.83105266	 0.84755766
Business   1631	 390	 269	 0.8070262	 0.858421	 0.83193064
Sports     1858	 100	 42	 0.94892746	 0.9778947	 0.96319336
Wor

# How to use already trained ClassifierDL pipeline or its model

We have two ways of using what we already trained: pipeline or model.

Let's see how we can save the entire pipeline, load it, and do some prediction with that pre-trained pipeline.

## Save and load pre-trained ClassifierDL pipeline

In [None]:
# Google Colab is free so it comes with a little memory. 
# It's not possible to save and load in this notebook. But you can do this locally or in a decent machine!

# pipelineModel.save("./classifierdl_pipeline")
# loadedPipeline = PipelineModel.load("./classifierdl_pipeline")
# loadedPipeline.transform(YOUR_DATAFRAME)

# Save and load pre-trained ClassifierDL model

In [None]:
# dbfs:/ or hdfs:/ if you are saving it on distributed file systems
pipelineModel.stages[-1].write().overwrite().save('./tmp_classifierDL_model')

Let's use our pre-trained ClassifierDLModel in a pipeline: 

In [None]:

# In a new pipeline you can load it for prediction
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLModel.load("./tmp_classifierDL_model") \
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")

pipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Now let's load it back so we can have prediction all together with everything in that pipeline:

In [None]:
from pyspark.sql.types import StringType

dfTest = spark.createDataFrame([
    "Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.",
    "Scientists have discovered irregular lumps beneath the icy surface of Jupiter's largest moon, Ganymede. These irregular masses may be rock formations, supported by Ganymede's icy shell for billions of years..."
], StringType()).toDF("description")

In [None]:
prediction = pipeline.fit(dfTest).transform(dfTest)

In [None]:
prediction.select("class.result").show()

prediction.select("class.metadata").show(truncate=False)

+----------+
|    result|
+----------+
|[Business]|
|[Sci/Tech]|
+----------+

+------------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                          |
+------------------------------------------------------------------------------------------------------------------+
|[{Sports -> 1.09076216E-4, Business -> 0.9996996, World -> 1.05234445E-4, Sci/Tech -> 8.617702E-5, sentence -> 0}]|
|[{Sports -> 8.139581E-19, Business -> 2.8369764E-17, World -> 3.814643E-17, Sci/Tech -> 1.0, sentence -> 0}]      |
+------------------------------------------------------------------------------------------------------------------+

