![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP
### Multi-class Text Classification
#### By using ClassifierDL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/classification/ClassifierDL_Train_multi_class_news_category_classifier.ipynb)

Only run this block if you are inside Google Colab otherwise skip it

In [1]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


In [4]:
import sparknlp

spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version; ", spark.version)

Spark NLP version:  2.5.1
Apache Spark version;  2.4.4


Let's download news category dataset for training our text classifier

In [3]:
!wget -O news_category_train.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv

--2020-06-01 14:03:51--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.108.53
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.108.53|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24032125 (23M) [text/csv]
Saving to: ‘news_category_train.csv’


2020-06-01 14:03:52 (52.0 MB/s) - ‘news_category_train.csv’ saved [24032125/24032125]



In [4]:
!wget -O news_category_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv

--2020-06-01 14:03:53--  https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.171.149
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.171.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1504408 (1.4M) [text/csv]
Saving to: ‘news_category_test.csv’


2020-06-01 14:03:53 (18.2 MB/s) - ‘news_category_test.csv’ saved [1504408/1504408]



In [5]:
!head news_category_train.csv

category,description
Business," Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again."
Business," Private investment firm Carlyle Group, which has a reputation for making well timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market."
Business, Soaring crude prices plus worries about the economy and the outlook for earnings are expected to hang over the stock market next week during the depth of the summer doldrums.
Business," Authorities have halted oil export flows from the main pipeline in southern Iraq after intelligence showed a rebel militia could strike infrastructure, an oil official said on Saturday."
Business," Tearaway world oil prices, toppling records and straining wallets, present a new economic menace barely three months before the US presidential elections."
Business," Stocks ended slightly higher on Friday but stayed near lows for the year as oil prices surged past  #36;

The content is inside `description` column and the labels are inside `category` column

In [None]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("news_category_train.csv")

In [7]:
trainDataset.show()

+--------+--------------------+
|category|         description|
+--------+--------------------+
|Business| Short sellers, W...|
|Business| Private investme...|
|Business| Soaring crude pr...|
|Business| Authorities have...|
|Business| Tearaway world o...|
|Business| Stocks ended sli...|
|Business| Assets of the na...|
|Business| Retail sales bou...|
|Business|" After earning a...|
|Business| Short sellers, W...|
|Business| Soaring crude pr...|
|Business| OPEC can do noth...|
|Business| Non OPEC oil exp...|
|Business| WASHINGTON/NEW Y...|
|Business| The dollar tumbl...|
|Business|If you think you ...|
|Business|The purchasing po...|
|Business|There is little c...|
|Business|The US trade defi...|
|Business|Oil giant Shell c...|
+--------+--------------------+
only showing top 20 rows



In [8]:
trainDataset.count()

120000

In [None]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [10]:
# actual content is inside description column
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in category column
classsifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("category")\
  .setMaxEpochs(5)\
  .setEnableOutputLogs(True)

pipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
pipelineModel = pipeline.fit(trainDataset)

In [17]:
!cd ~/annotator_logs && ls -l

total 4
-rw-r--r-- 1 root root 529 Jun  1 14:11 ClassifierDLApproach_d82e68030034.log


In [18]:
!cat ~/annotator_logs/ClassifierDLApproach_d82e68030034.log

Training started - total epochs: 5 - learning rate: 0.005 - batch size: 64 - training examples: 120000
Epoch 0/5 - 35.876454147%.2fs - loss: 1588.0098 - accuracy: 0.8812917 - batches: 1875
Epoch 1/5 - 34.984400619%.2fs - loss: 1569.1891 - accuracy: 0.892 - batches: 1875
Epoch 2/5 - 34.980620721%.2fs - loss: 1560.8793 - accuracy: 0.8966333 - batches: 1875
Epoch 3/5 - 34.97171791%.2fs - loss: 1556.4751 - accuracy: 0.9005917 - batches: 1875
Epoch 4/5 - 35.060583703%.2fs - loss: 1550.6415 - accuracy: 0.90370834 - batches: 1875


# How to use already trained ClassifierDL pipeline or its model

We have two ways of using what we already trained: pipeline or model.

Let's see how we can save the entire pipeline, load it, and do some prediction with that pre-trained pipeline.

## Save and load pre-trained ClassifierDL pipeline

In [None]:
# Google Colab is free so it comes with a little memory. 
# It's not possible to save and load in this notebook. But you can do this locally or in a decent machine!

# pipelineModel.save("./classifierdl_pipeline")
# loadedPipeline = PipelineModel.load("./classifierdl_pipeline")
# loadedPipeline.transform(YOUR_DATAFRAME)

# Save and load pre-trained ClassifierDL model

In [None]:
# dbfs:/ or hdfs:/ if you are saving it on distributed file systems
pipelineModel.stages[-1].write().overwrite().save('./tmp_classifierDL_model')



Let's use our pre-trained ClassifierDLModel in a pipeline: 

In [22]:

# In a new pipeline you can load it for prediction
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLModel.load("./tmp_classifierDL_model") \
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")

pipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])


tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


Now let's load it back so we can have prediction all together with everything in that pipeline:

In [None]:
from pyspark.sql.types import StringType

dfTest = spark.createDataFrame([
    "Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.",
    "Scientists have discovered irregular lumps beneath the icy surface of Jupiter's largest moon, Ganymede. These irregular masses may be rock formations, supported by Ganymede's icy shell for billions of years..."
], StringType()).toDF("description")

In [None]:
prediction = pipeline.fit(dfTest).transform(dfTest)

In [27]:
prediction.select("class.result").show()

prediction.select("class.metadata").show(truncate=False)

+----------+
|    result|
+----------+
|[Business]|
|[Sci/Tech]|
+----------+

+-----------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------+
|[[Sports -> 9.916687E-8, Business -> 0.99999917, World -> 4.8718215E-7, Sci/Tech -> 2.1577937E-7, sentence -> 0]]|
|[[Sports -> 9.949142E-13, Business -> 2.2465226E-11, World -> 6.4324095E-11, Sci/Tech -> 1.0, sentence -> 0]]    |
+-----------------------------------------------------------------------------------------------------------------+



# Evaluation 

Let's evaluatte our ClassifierDL model we trained earlier, saved it, and loaded it into a new pipeline by using a test dataset that model has never seen:

In [None]:
testDataset = spark.read \
      .option("header", True) \
      .csv("news_category_test.csv")

In [None]:
preds = pipelineModel.transform(testDataset)

In [30]:
preds.select('category','description',"class.result").show(50, truncate=50)

+--------+--------------------------------------------------+----------+
|category|                                       description|    result|
+--------+--------------------------------------------------+----------+
|Business|Unions representing workers at Turner   Newall ...|[Business]|
|Sci/Tech| TORONTO, Canada    A second team of rocketeers...|[Sci/Tech]|
|Sci/Tech| A company founded by a chemistry researcher at...|[Sci/Tech]|
|Sci/Tech| It's barely dawn when Mike Fitzpatrick starts ...|[Sci/Tech]|
|Sci/Tech| Southern California's smog fighting agency wen...|[Sci/Tech]|
|Sci/Tech|"The British Department for Education and Skill...|[Sci/Tech]|
|Sci/Tech|"confessed author of the Netsky and Sasser viru...|[Sci/Tech]|
|Sci/Tech|\\FOAF/LOAF  and bloom filters have a lot of in...|[Sci/Tech]|
|Sci/Tech|"Wiltshire Police warns about ""phishing"" afte...|[Sci/Tech]|
|Sci/Tech|In its first two years, the UK's dedicated card...|[Sci/Tech]|
|Sci/Tech| A group of technology companies  includi

In [None]:
preds_df = preds.select('category','description',"class.result").toPandas()

In [None]:
# The result is an array since in Spark NLP you can have multiple sentences.
# This means you can add SentenceDetector in the pipeline and feed it into
# UniversalSentenceEncoder and you can have prediction based on each sentence.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [None]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

Let's use `classification_report` from `sklearn` to evaluate the final scores. (keep in mind due to limited resources on a free Google Colab we only used 5 Epochs :)

In [34]:
print (classification_report(preds_df['result'], preds_df['category']))

              precision    recall  f1-score   support

    Business       0.82      0.87      0.84      1801
    Sci/Tech       0.90      0.84      0.87      2036
      Sports       0.98      0.95      0.97      1957
       World       0.87      0.92      0.90      1806

    accuracy                           0.89      7600
   macro avg       0.89      0.89      0.89      7600
weighted avg       0.90      0.89      0.89      7600

