![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

# Spark NLP
### Multi-class Text Classification
#### By using ClassifierDL

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/english/classification/ClassifierDL_Train_multi_class_news_category_classifier.ipynb)

Only run this block if you are inside Google Colab otherwise skip it

In [None]:
import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed -q pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed -q spark-nlp==2.4.4

`UniversalSentenceEncoder` requires more `buffer.max` so we create the SparkSession manually:

In [1]:
import sparknlp
from pyspark.sql import SparkSession

def start():
    builder = SparkSession.builder \
        .appName("Spark NLP") \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")\
        .config("spark.kryoserializer.buffer.max", "1000M")\
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.4.4")

    return builder.getOrCreate()

  
spark = start()

print("Spark NLP version")
sparknlp.version()
print("Apache Spark version")
spark.version

Spark NLP version
Apache Spark version


'2.4.4'

Let's download news category dataset for training our text classifier

In [None]:
!wget -O news_category_train.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_train.csv

In [None]:
!wget -O news_category_test.csv https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/classifier-dl/news_Category/news_category_test.csv

In [None]:
!head news_category_train.csv

The content is inside `description` column and the labels are inside `category` column

In [2]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("news_category_train.csv")

In [3]:
trainDataset.show()

+--------+--------------------+
|category|         description|
+--------+--------------------+
|Business| Short sellers, W...|
|Business| Private investme...|
|Business| Soaring crude pr...|
|Business| Authorities have...|
|Business| Tearaway world o...|
|Business| Stocks ended sli...|
|Business| Assets of the na...|
|Business| Retail sales bou...|
|Business|" After earning a...|
|Business| Short sellers, W...|
|Business| Soaring crude pr...|
|Business| OPEC can do noth...|
|Business| Non OPEC oil exp...|
|Business| WASHINGTON/NEW Y...|
|Business| The dollar tumbl...|
|Business|If you think you ...|
|Business|The purchasing po...|
|Business|There is little c...|
|Business|The US trade defi...|
|Business|Oil giant Shell c...|
+--------+--------------------+
only showing top 20 rows



In [4]:
trainDataset.count()

120000

In [5]:
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

In [6]:
# actual content is inside description column
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in category column
classsifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("category")\
  .setMaxEpochs(10)\
  .setEnableOutputLogs(True)

pipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [7]:
pipelineModel = pipeline.fit(trainDataset)

In [25]:
!cd ~/annotator_logs && ls -l

total 8
-rw-r--r--  1 maziyar  staff  974 Mar 17 15:46 ClassifierDLApproach_d4a8d8ae15c4.log


In [10]:
!cat ~/annotator_logs/ClassifierDLApproach_d4a8d8ae15c4.log

Training started - total epochs: 10 - learning rate: 0.005 - batch size: 64 - training examples: 120000
Epoch 0/10 - 20.595176306%.2fs - loss: 1620.4523 - accuracy: 0.88139164 - batches: 1875
Epoch 1/10 - 20.896655075%.2fs - loss: 1595.1614 - accuracy: 0.8924 - batches: 1875
Epoch 2/10 - 18.908607226%.2fs - loss: 1581.3849 - accuracy: 0.8971583 - batches: 1875
Epoch 3/10 - 18.855000017%.2fs - loss: 1570.0774 - accuracy: 0.90113336 - batches: 1875
Epoch 4/10 - 18.834848244%.2fs - loss: 1566.8865 - accuracy: 0.90415835 - batches: 1875
Epoch 5/10 - 18.658137128%.2fs - loss: 1564.1078 - accuracy: 0.90699166 - batches: 1875
Epoch 6/10 - 18.913159529%.2fs - loss: 1561.3577 - accuracy: 0.9088 - batches: 1875
Epoch 7/10 - 18.658633957%.2fs - loss: 1558.2217 - accuracy: 0.91049165 - batches: 1875
Epoch 8/10 - 18.919921582%.2fs - loss: 1554.3966 - accuracy: 0.9116833 - batches: 1875
Epoch 9/10 - 18.389415558%.2fs - loss: 1553.3986 - accuracy: 0.91281664 - batches: 1875


In [11]:
from pyspark.sql.types import StringType

dfTest = spark.createDataFrame([
    "Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.",
    "Scientists have discovered irregular lumps beneath the icy surface of Jupiter's largest moon, Ganymede. These irregular masses may be rock formations, supported by Ganymede's icy shell for billions of years..."
], StringType()).toDF("description")

In [12]:
prediction = pipelineModel.transform(dfTest)

In [13]:
prediction.select("class.result").show()

prediction.select("class.metadata").show(truncate=False)

+----------+
|    result|
+----------+
|[Business]|
|[Sci/Tech]|
+----------+

+-----------------------------------------------------------------------------------------------------------------+
|metadata                                                                                                         |
+-----------------------------------------------------------------------------------------------------------------+
|[[Sports -> 1.874596E-7, Business -> 0.99999976, World -> 3.8649663E-8, Sci/Tech -> 3.9560025E-8, sentence -> 0]]|
|[[Sports -> 8.1529744E-19, Business -> 1.2242059E-17, World -> 1.8783108E-19, Sci/Tech -> 1.0, sentence -> 0]]   |
+-----------------------------------------------------------------------------------------------------------------+



In [14]:
testDataset = spark.read \
      .option("header", True) \
      .csv("news_category_test.csv")

In [15]:
preds = pipelineModel.transform(testDataset)

In [16]:
preds.select('category','description',"class.result").show(50, truncate=50)

+--------+--------------------------------------------------+----------+
|category|                                       description|    result|
+--------+--------------------------------------------------+----------+
|Business|Unions representing workers at Turner   Newall ...|[Business]|
|Sci/Tech| TORONTO, Canada    A second team of rocketeers...|[Sci/Tech]|
|Sci/Tech| A company founded by a chemistry researcher at...|[Sci/Tech]|
|Sci/Tech| It's barely dawn when Mike Fitzpatrick starts ...|[Sci/Tech]|
|Sci/Tech| Southern California's smog fighting agency wen...|[Sci/Tech]|
|Sci/Tech|"The British Department for Education and Skill...|   [World]|
|Sci/Tech|"confessed author of the Netsky and Sasser viru...|[Sci/Tech]|
|Sci/Tech|\\FOAF/LOAF  and bloom filters have a lot of in...|[Sci/Tech]|
|Sci/Tech|"Wiltshire Police warns about ""phishing"" afte...|[Sci/Tech]|
|Sci/Tech|In its first two years, the UK's dedicated card...|[Sci/Tech]|
|Sci/Tech| A group of technology companies  includi

In [17]:
preds_df = preds.select('category','description',"class.result").toPandas()

In [18]:
# The result is an array since in Spark NLP you can have multiple sentences.
# This means you can add SentenceDetector in the pipeline and feed it into
# UniversalSentenceEncoder and you can have prediction based on each sentence.
# Let's explode the array and get the item(s) inside of result column out
preds_df['result'] = preds_df['result'].apply(lambda x : x[0])

In [19]:
# We are going to use sklearn to evalute the results on test dataset
from sklearn.metrics import classification_report

In [20]:
print (classification_report(preds_df['result'], preds_df['category']))

              precision    recall  f1-score   support

    Business       0.85      0.85      0.85      1910
    Sci/Tech       0.88      0.85      0.87      1955
      Sports       0.98      0.95      0.97      1948
       World       0.87      0.93      0.90      1787

    accuracy                           0.90      7600
   macro avg       0.90      0.90      0.90      7600
weighted avg       0.90      0.90      0.90      7600

