![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/5.1_Text_classification_examples_in_SparkML_SparkNLP.ipynb)

# Text Classification with Spark NLP


<b>  if you want to work with Spark 2.3 </b>
```
! pip install --upgrade pyspark==2.4.4

! pip install --ignore-installed -q spark-nlp==2.7.5

import sparknlp

spark = sparknlp.start(spark23=True)
```

In [None]:

import sys

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline

from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *

import pandas as pd


In [None]:
import sparknlp
spark = sparknlp.start()

print("Spark NLP version: ", sparknlp.version())
print("Apache Spark version: ", spark.version)

Spark NLP version:  3.1.0
Apache Spark version:  3.0.2


In [None]:
! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_train.csv
! wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/tutorials/Certification_Trainings/Public/data/news_category_test.csv

In [None]:
# newsDF = spark.read.parquet("data/news_category.parquet") >> if it is a parquet

newsDF = spark.read \
      .option("header", True) \
      .csv("news_category_train.csv")

newsDF.show(truncate=50)

+--------+--------------------------------------------------+
|category|                                       description|
+--------+--------------------------------------------------+
|Business| Short sellers, Wall Street's dwindling band of...|
|Business| Private investment firm Carlyle Group, which h...|
|Business| Soaring crude prices plus worries about the ec...|
|Business| Authorities have halted oil export flows from ...|
|Business| Tearaway world oil prices, toppling records an...|
|Business| Stocks ended slightly higher on Friday but sta...|
|Business| Assets of the nation's retail money market mut...|
|Business| Retail sales bounced back a bit in July, and n...|
|Business|" After earning a PH.D. in Sociology, Danny Baz...|
|Business| Short sellers, Wall Street's dwindling  band o...|
|Business| Soaring crude prices plus worries  about the e...|
|Business| OPEC can do nothing to douse scorching  oil pr...|
|Business| Non OPEC oil exporters should consider  increa...|
|Busines

In [None]:
newsDF.take(2)

[Row(category='Business', description=" Short sellers, Wall Street's dwindling band of ultra cynics, are seeing green again."),
 Row(category='Business', description=' Private investment firm Carlyle Group, which has a reputation for making well timed and occasionally controversial plays in the defense industry, has quietly placed its bets on another part of the market.')]

In [None]:
from pyspark.sql.functions import col

newsDF.groupBy("category") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+--------+-----+
|category|count|
+--------+-----+
|Sci/Tech|30000|
|   World|30000|
|  Sports|30000|
|Business|30000|
+--------+-----+



## Building Classification Pipeline

### LogReg with CountVectorizer

Tokenizer: Tokenization 

stopwordsRemover: Remove Stop Words

countVectors: Count vectors (“document-term vectors”)

In [None]:
from pyspark.ml.feature import CountVectorizer, HashingTF, IDF, OneHotEncoder, StringIndexer, VectorAssembler, SQLTransformer


In [None]:
%%time

document_assembler = DocumentAssembler() \
      .setInputCol("description") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")
      
normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("stem")

finisher = Finisher() \
      .setInputCols(["stem"]) \
      .setOutputCols(["token_features"]) \
      .setOutputAsArray(True) \
      .setCleanAnnotations(False)

countVectors = CountVectorizer(inputCol="token_features", outputCol="features", vocabSize=10000, minDF=5)

label_stringIdx = StringIndexer(inputCol = "category", outputCol = "label")

nlp_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            stemmer, 
            finisher,
            countVectors,
            label_stringIdx])

nlp_model = nlp_pipeline.fit(newsDF)

processed = nlp_model.transform(newsDF)

processed.count()

CPU times: user 596 ms, sys: 85.5 ms, total: 682 ms
Wall time: 1min 35s


In [None]:
processed.select('description','token_features').show(truncate=50)

+--------------------------------------------------+--------------------------------------------------+
|                                       description|                                    token_features|
+--------------------------------------------------+--------------------------------------------------+
| Short sellers, Wall Street's dwindling band of...|[short, seller, wall, street, dwindl, band, ult...|
| Private investment firm Carlyle Group, which h...|[privat, invest, firm, carlyl, group, reput, ma...|
| Soaring crude prices plus worries about the ec...|[soar, crude, price, plu, worri, economi, outlo...|
| Authorities have halted oil export flows from ...|[author, halt, oil, export, flow, main, pipelin...|
| Tearaway world oil prices, toppling records an...|[tearawai, world, oil, price, toppl, record, st...|
| Stocks ended slightly higher on Friday but sta...|[stock, end, slightli, higher, fridai, staye, n...|
| Assets of the nation's retail money market mut...|[asset, nati

In [None]:
processed.select('token_features').take(2)

[Row(token_features=['short', 'seller', 'wall', 'street', 'dwindl', 'band', 'ultra', 'cynic', 'see', 'green']),
 Row(token_features=['privat', 'invest', 'firm', 'carlyl', 'group', 'reput', 'make', 'well', 'time', 'occasion', 'controversi', 'plai', 'defens', 'industri', 'quietli', 'place', 'bet', 'anoth', 'part', 'market'])]

In [None]:
processed.select('features').take(2)

[Row(features=SparseVector(10000, {241: 1.0, 384: 1.0, 467: 1.0, 744: 1.0, 838: 1.0, 2228: 1.0, 3675: 1.0, 6139: 1.0, 6239: 1.0})),
 Row(features=SparseVector(10000, {26: 1.0, 38: 1.0, 46: 1.0, 68: 1.0, 117: 1.0, 155: 1.0, 182: 1.0, 197: 1.0, 246: 1.0, 303: 1.0, 320: 1.0, 407: 1.0, 427: 1.0, 621: 1.0, 867: 1.0, 2359: 1.0, 2824: 1.0, 2867: 1.0, 6814: 1.0}))]

In [None]:
processed.select('description','features','label').show()

+--------------------+--------------------+-----+
|         description|            features|label|
+--------------------+--------------------+-----+
| Short sellers, W...|(10000,[241,384,4...|  1.0|
| Private investme...|(10000,[26,38,46,...|  1.0|
| Soaring crude pr...|(10000,[15,28,46,...|  1.0|
| Authorities have...|(10000,[0,32,35,4...|  1.0|
| Tearaway world o...|(10000,[1,2,11,28...|  1.0|
| Stocks ended sli...|(10000,[3,13,14,2...|  1.0|
| Assets of the na...|(10000,[0,4,10,15...|  1.0|
| Retail sales bou...|(10000,[0,1,10,15...|  1.0|
|" After earning a...|(10000,[98,99,125...|  1.0|
| Short sellers, W...|(10000,[241,384,4...|  1.0|
| Soaring crude pr...|(10000,[15,28,46,...|  1.0|
| OPEC can do noth...|(10000,[0,24,28,2...|  1.0|
| Non OPEC oil exp...|(10000,[0,21,28,3...|  1.0|
| WASHINGTON/NEW Y...|(10000,[2,4,13,14...|  1.0|
| The dollar tumbl...|(10000,[2,14,72,1...|  1.0|
|If you think you ...|(10000,[74,76,143...|  1.0|
|The purchasing po...|(10000,[46,54,167...|  1.0|


In [None]:
# set seed for reproducibility
(trainingData, testData) = processed.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 84038
Test Dataset Count: 35962


In [None]:
trainingData.printSchema()

root
 |-- category: string (nullable = true)
 |-- description: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |   

In [None]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0)

lrModel = lr.fit(trainingData)

predictions = lrModel.transform(testData)

predictions.filter(predictions['prediction'] == 0) \
    .select("description","category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)


+------------------------------+--------+------------------------------+-----+----------+
|                   description|category|                   probability|label|prediction|
+------------------------------+--------+------------------------------+-----+----------+
|Novell brings updated kerne...|Business|[0.9999995988391355,2.43405...|  1.0|       0.0|
| Cray taps Linux for more a...|Sci/Tech|[0.999989295731228,5.279009...|  0.0|       0.0|
|F5 bolsters firewall family...|Sci/Tech|[0.9999765208304277,2.07099...|  0.0|       0.0|
| You Software Inc. announce...|Sci/Tech|[0.9999616282388853,1.09273...|  0.0|       0.0|
|Awarding the iMac G5 five s...|Sci/Tech|[0.9997536740123691,1.61162...|  0.0|       0.0|
|\\I've blogged before  abou...|Sci/Tech|[0.9995861395030622,3.26631...|  0.0|       0.0|
| At a special music event o...|Sci/Tech|[0.9995683585016614,3.19690...|  0.0|       0.0|
| Not long ago most corporat...|Sci/Tech|[0.9995509842232488,2.62018...|  0.0|       0.0|
|IBM Corp.

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")

evaluator.evaluate(predictions)

0.8989802998146215

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
y_true = predictions.select("label")
y_true = y_true.toPandas()

y_pred = predictions.select("prediction")
y_pred = y_pred.toPandas()

In [None]:
y_pred.prediction.value_counts()

3.0    9384
0.0    9044
1.0    8868
2.0    8666
Name: prediction, dtype: int64

In [None]:
cnf_matrix = confusion_matrix(list(y_true.label.astype(int)), list(y_pred.prediction.astype(int)))
cnf_matrix

array([[7792,  732,  312,  114],
       [ 881, 7658,  284,   85],
       [ 308,  426, 7993,  291],
       [  63,   52,   77, 8894]])

In [None]:
print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

         0.0       0.86      0.87      0.87      8950
         1.0       0.86      0.86      0.86      8908
         2.0       0.92      0.89      0.90      9018
         3.0       0.95      0.98      0.96      9086

    accuracy                           0.90     35962
   macro avg       0.90      0.90      0.90     35962
weighted avg       0.90      0.90      0.90     35962

0.8991991546632556


### LogReg with TFIDF

In [None]:
from pyspark.ml.feature import HashingTF, IDF

hashingTF = HashingTF(inputCol="token_features", outputCol="rawFeatures", numFeatures=10000)

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms

nlp_pipeline_tf = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            stemmer, 
            finisher,
            hashingTF,
            idf,
            label_stringIdx])

nlp_model_tf = nlp_pipeline_tf.fit(newsDF)

processed_tf = nlp_model_tf.transform(newsDF)

processed_tf.count()


120000

In [None]:
# set seed for reproducibility
processed_tf.select('description','features','label').show()

+--------------------+--------------------+-----+
|         description|            features|label|
+--------------------+--------------------+-----+
| Short sellers, W...|(10000,[25,625,66...|  1.0|
| Private investme...|(10000,[82,111,15...|  1.0|
| Soaring crude pr...|(10000,[410,1097,...|  1.0|
| Authorities have...|(10000,[1611,1637...|  1.0|
| Tearaway world o...|(10000,[1150,1427...|  1.0|
| Stocks ended sli...|(10000,[332,410,6...|  1.0|
| Assets of the na...|(10000,[1442,1788...|  1.0|
| Retail sales bou...|(10000,[25,117,97...|  1.0|
|" After earning a...|(10000,[114,643,7...|  1.0|
| Short sellers, W...|(10000,[25,625,66...|  1.0|
| Soaring crude pr...|(10000,[410,1097,...|  1.0|
| OPEC can do noth...|(10000,[616,904,1...|  1.0|
| Non OPEC oil exp...|(10000,[616,2224,...|  1.0|
| WASHINGTON/NEW Y...|(10000,[351,360,3...|  1.0|
| The dollar tumbl...|(10000,[359,456,9...|  1.0|
|If you think you ...|(10000,[1041,1564...|  1.0|
|The purchasing po...|(10000,[2198,4091...|  1.0|


In [None]:
(trainingData, testData) = processed_tf.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 84038
Test Dataset Count: 35962


In [None]:
lrModel_tf = lr.fit(trainingData)

predictions_tf = lrModel_tf.transform(testData)

predictions_tf.select("description","category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)


+------------------------------+--------+------------------------------+-----+----------+
|                   description|category|                   probability|label|prediction|
+------------------------------+--------+------------------------------+-----+----------+
|Novell brings updated kerne...|Business|[0.9999995229578786,2.86181...|  1.0|       0.0|
|F5 bolsters firewall family...|Sci/Tech|[0.9999631964683853,2.87934...|  0.0|       0.0|
|\\Sam blogs about his wiki ...|Sci/Tech|[0.9999591803496525,3.00816...|  0.0|       0.0|
| Cray taps Linux for more a...|Sci/Tech|[0.99988175049369,1.0958016...|  0.0|       0.0|
| At a special music event o...|Sci/Tech|[0.9997199938921201,2.03591...|  0.0|       0.0|
| You Software Inc. announce...|Sci/Tech|[0.9996235826906964,5.33405...|  0.0|       0.0|
|MOFFETT FIELD, CALIFORNIA -...|Sci/Tech|[0.9995490457561748,2.09063...|  0.0|       0.0|
|Sun Microsystems will integ...|Sci/Tech|[0.9990914778205654,2.87415...|  0.0|       0.0|
| Microsof

In [None]:
y_true = predictions_tf.select("label")
y_true = y_true.toPandas()

y_pred = predictions_tf.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

         0.0       0.85      0.85      0.85      8950
         1.0       0.85      0.85      0.85      8908
         2.0       0.91      0.88      0.90      9018
         3.0       0.94      0.96      0.95      9086

    accuracy                           0.89     35962
   macro avg       0.89      0.89      0.89     35962
weighted avg       0.89      0.89      0.89     35962

0.8877148100773038


### Random Forest with TFIDF

In [None]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(labelCol="label", \
                            featuresCol="features", \
                            numTrees = 100, \
                            maxDepth = 4, \
                            maxBins = 32)

# Train model with Training Data
rfModel = rf.fit(trainingData)
predictions_rf = rfModel.transform(testData)


In [None]:
predictions_rf.select("description","category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)

+------------------------------+--------+------------------------------+-----+----------+
|                   description|category|                   probability|label|prediction|
+------------------------------+--------+------------------------------+-----+----------+
|Google Inc., the world #39;...|Business|[0.3817798792801465,0.24244...|  1.0|       0.0|
| Microsoft (Nasdaq: MSFT) l...|Sci/Tech|[0.3812741133118802,0.22181...|  0.0|       0.0|
|NOVEMBER 15, 2004 (IDG NEWS...|Sci/Tech|[0.3802678197628161,0.26350...|  0.0|       0.0|
|In response to the growing ...|Sci/Tech|[0.37445493519549894,0.2517...|  0.0|       0.0|
|Web services are becoming i...|Sci/Tech|[0.36954934339567,0.2339840...|  0.0|       0.0|
| Microsoft Corp. introduced...|Sci/Tech|[0.36868311701599515,0.2482...|  0.0|       0.0|
|The same week Microsoft rel...|Sci/Tech|[0.36621667041915645,0.2232...|  0.0|       0.0|
|Microsoft plans to launch a...|Business|[0.36604196140734435,0.2225...|  1.0|       0.0|
|A critica

In [None]:
y_true = predictions_rf.select("label")
y_true = y_true.toPandas()

y_pred = predictions_rf.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

         0.0       0.76      0.64      0.70      8950
         1.0       0.74      0.70      0.72      8908
         2.0       0.78      0.76      0.77      9018
         3.0       0.72      0.89      0.79      9086

    accuracy                           0.75     35962
   macro avg       0.75      0.75      0.75     35962
weighted avg       0.75      0.75      0.75     35962

0.7478171403147768


## LogReg with Spark NLP Glove Word Embeddings

In [None]:
document_assembler = DocumentAssembler() \
      .setInputCol("description") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")
    
normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

glove_embeddings = WordEmbeddingsModel().pretrained() \
      .setInputCols(["document",'cleanTokens'])\
      .setOutputCol("embeddings")\
      .setCaseSensitive(False)

embeddingsSentence = SentenceEmbeddings() \
      .setInputCols(["document", "embeddings"]) \
      .setOutputCol("sentence_embeddings") \
      .setPoolingStrategy("AVERAGE")
    
embeddings_finisher = EmbeddingsFinisher() \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCols(["finished_sentence_embeddings"]) \
      .setOutputAsVector(True)\
      .setCleanAnnotations(False)

explodeVectors = SQLTransformer(statement=
      "SELECT EXPLODE(finished_sentence_embeddings) AS features, * FROM __THIS__")

label_stringIdx = StringIndexer(inputCol = "category", outputCol = "label")


nlp_pipeline_w2v = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            glove_embeddings,
            embeddingsSentence,
            embeddings_finisher,
            explodeVectors,
            label_stringIdx])

nlp_model_w2v = nlp_pipeline_w2v.fit(newsDF)

processed_w2v = nlp_model_w2v.transform(newsDF)

processed_w2v.count()


glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


120000

In [None]:
processed_w2v.columns

['features',
 'category',
 'description',
 'document',
 'token',
 'normalized',
 'cleanTokens',
 'embeddings',
 'sentence_embeddings',
 'finished_sentence_embeddings',
 'label']

In [None]:
processed_w2v.show(5)

+--------------------+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+-----+
|            features|category|         description|            document|               token|          normalized|         cleanTokens|          embeddings| sentence_embeddings|finished_sentence_embeddings|label|
+--------------------+--------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------------------------+-----+
|[-0.1556767076253...|Business| Short sellers, W...|[[document, 0, 84...|[[token, 1, 5, Sh...|[[token, 1, 5, Sh...|[[token, 1, 5, Sh...|[[word_embeddings...|[[sentence_embedd...|        [[-0.155676707625...|  1.0|
|[-0.0144653050228...|Business| Private investme...|[[document, 0, 20...|[[token, 1, 7, Pr...|[[token, 1, 7, Pr...|[[token, 1, 7, Pr...|[[word_e

In [None]:
processed_w2v.select('finished_sentence_embeddings').take(1)

[Row(finished_sentence_embeddings=[DenseVector([-0.1557, 0.196, 0.1099, -0.3089, 0.16, 0.1672, -0.4649, -0.1101, -0.053, -0.1551, 0.0327, 0.0772, 0.1494, -0.1865, 0.1155, -0.0597, 0.0234, -0.0451, 0.2361, -0.0089, 0.3358, 0.0444, 0.0088, -0.1453, 0.2289, 0.0914, -0.1665, -0.3726, 0.1892, 0.121, 0.1993, -0.0239, -0.1346, 0.1159, 0.2086, 0.1285, 0.068, 0.1372, 0.3153, -0.1934, 0.0257, -0.226, -0.0984, 0.1139, 0.1413, -0.3743, 0.072, 0.1403, 0.251, -0.3106, 0.1709, -0.0697, -0.0554, 0.5123, -0.1873, -1.7784, 0.0295, 0.1014, 0.9268, 0.2129, -0.1354, 0.5739, -0.0679, 0.461, 0.4216, 0.0225, 0.4456, -0.2462, 0.1411, -0.3258, 0.0025, 0.0114, -0.3895, -0.1106, -0.261, 0.0147, 0.0781, 0.1268, -0.2042, -0.2278, 0.5096, 0.1539, -0.3515, -0.0102, -0.7003, -0.3872, -0.1668, -0.2405, -0.0766, 0.1396, -0.0592, -0.1568, -0.1606, -0.1371, -0.684, -0.2549, -0.1541, 0.1536, 0.2715, 0.3342])])]

In [None]:
# IF SQLTransformer IS NOT USED INSIDE THE PIPELINE, WE CAN EXPLODE OUTSIDE
from pyspark.sql.functions import explode

# processed_w2v= processed_w2v.withColumn("features", explode(processed_w2v.finished_sentence_embeddings))

In [None]:
processed_w2v.select("features").take(1)

[Row(features=DenseVector([-0.1557, 0.196, 0.1099, -0.3089, 0.16, 0.1672, -0.4649, -0.1101, -0.053, -0.1551, 0.0327, 0.0772, 0.1494, -0.1865, 0.1155, -0.0597, 0.0234, -0.0451, 0.2361, -0.0089, 0.3358, 0.0444, 0.0088, -0.1453, 0.2289, 0.0914, -0.1665, -0.3726, 0.1892, 0.121, 0.1993, -0.0239, -0.1346, 0.1159, 0.2086, 0.1285, 0.068, 0.1372, 0.3153, -0.1934, 0.0257, -0.226, -0.0984, 0.1139, 0.1413, -0.3743, 0.072, 0.1403, 0.251, -0.3106, 0.1709, -0.0697, -0.0554, 0.5123, -0.1873, -1.7784, 0.0295, 0.1014, 0.9268, 0.2129, -0.1354, 0.5739, -0.0679, 0.461, 0.4216, 0.0225, 0.4456, -0.2462, 0.1411, -0.3258, 0.0025, 0.0114, -0.3895, -0.1106, -0.261, 0.0147, 0.0781, 0.1268, -0.2042, -0.2278, 0.5096, 0.1539, -0.3515, -0.0102, -0.7003, -0.3872, -0.1668, -0.2405, -0.0766, 0.1396, -0.0592, -0.1568, -0.1606, -0.1371, -0.684, -0.2549, -0.1541, 0.1536, 0.2715, 0.3342]))]

In [None]:
processed_w2v.select("features").take(1)

[Row(features=DenseVector([-0.1557, 0.196, 0.1099, -0.3089, 0.16, 0.1672, -0.4649, -0.1101, -0.053, -0.1551, 0.0327, 0.0772, 0.1494, -0.1865, 0.1155, -0.0597, 0.0234, -0.0451, 0.2361, -0.0089, 0.3358, 0.0444, 0.0088, -0.1453, 0.2289, 0.0914, -0.1665, -0.3726, 0.1892, 0.121, 0.1993, -0.0239, -0.1346, 0.1159, 0.2086, 0.1285, 0.068, 0.1372, 0.3153, -0.1934, 0.0257, -0.226, -0.0984, 0.1139, 0.1413, -0.3743, 0.072, 0.1403, 0.251, -0.3106, 0.1709, -0.0697, -0.0554, 0.5123, -0.1873, -1.7784, 0.0295, 0.1014, 0.9268, 0.2129, -0.1354, 0.5739, -0.0679, 0.461, 0.4216, 0.0225, 0.4456, -0.2462, 0.1411, -0.3258, 0.0025, 0.0114, -0.3895, -0.1106, -0.261, 0.0147, 0.0781, 0.1268, -0.2042, -0.2278, 0.5096, 0.1539, -0.3515, -0.0102, -0.7003, -0.3872, -0.1668, -0.2405, -0.0766, 0.1396, -0.0592, -0.1568, -0.1606, -0.1371, -0.684, -0.2549, -0.1541, 0.1536, 0.2715, 0.3342]))]

In [None]:
processed_w2v.select('description','features','label').show()


+--------------------+--------------------+-----+
|         description|            features|label|
+--------------------+--------------------+-----+
| Short sellers, W...|[-0.1556767076253...|  1.0|
| Private investme...|[-0.0144653050228...|  1.0|
| Soaring crude pr...|[0.10348732769489...|  1.0|
| Authorities have...|[-0.0355810523033...|  1.0|
| Tearaway world o...|[0.00647281948477...|  1.0|
| Stocks ended sli...|[0.20069395005702...|  1.0|
| Assets of the na...|[0.38012433052062...|  1.0|
| Retail sales bou...|[0.20352847874164...|  1.0|
|" After earning a...|[0.13536226749420...|  1.0|
| Short sellers, W...|[-0.1556767076253...|  1.0|
| Soaring crude pr...|[0.10348732769489...|  1.0|
| OPEC can do noth...|[0.20307321846485...|  1.0|
| Non OPEC oil exp...|[0.09010648727416...|  1.0|
| WASHINGTON/NEW Y...|[0.10887209326028...|  1.0|
| The dollar tumbl...|[0.05723679438233...|  1.0|
|If you think you ...|[0.11463439464569...|  1.0|
|The purchasing po...|[0.05890964344143...|  1.0|


In [None]:
# set seed for reproducibility
(trainingData, testData) = processed_w2v.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 84038
Test Dataset Count: 35962


In [None]:
from pyspark.sql.functions import udf

@udf("long")
def num_nonzeros(v):
    return v.numNonzeros()

testData = testData.where(num_nonzeros("features") != 0)

In [None]:
lrModel_w2v = lr.fit(trainingData)

In [None]:
predictions_w2v = lrModel_w2v.transform(testData)

predictions_w2v.select("description","category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)


+------------------------------+--------+------------------------------+-----+----------+
|                   description|category|                   probability|label|prediction|
+------------------------------+--------+------------------------------+-----+----------+
|The KDE Project has release...|Sci/Tech|[0.9993026934526253,5.26811...|  0.0|       0.0|
| Users can now access searc...|Sci/Tech|[0.9984190388827833,0.00112...|  0.0|       0.0|
|" The Xbox version of ""Doom 3|Sci/Tech|[0.9978700979929389,8.61016...|  0.0|       0.0|
|" The Xbox version of ""Doom 3|Sci/Tech|[0.9978700979929389,8.61016...|  0.0|       0.0|
|" The Xbox version of ""Doom 3|Sci/Tech|[0.9978700979929389,8.61016...|  0.0|       0.0|
|With Google Desktop Search,...|Sci/Tech|[0.9966360633091306,0.00214...|  0.0|       0.0|
|Google has finally announce...|Business|[0.996501742592862,0.002382...|  1.0|       0.0|
|But users of the popular co...|Sci/Tech|[0.9963135491275907,0.00278...|  0.0|       0.0|
|" Users o

In [None]:
y_true = predictions_w2v.select("label")
y_true = y_true.toPandas()

y_pred = predictions_w2v.select("prediction")
y_pred = y_pred.toPandas()

print(classification_report(y_true.label, y_pred.prediction))
print(accuracy_score(y_true.label, y_pred.prediction))

              precision    recall  f1-score   support

         0.0       0.82      0.81      0.81      8902
         1.0       0.82      0.83      0.82      9072
         2.0       0.88      0.87      0.87      9006
         3.0       0.93      0.95      0.94      8982

    accuracy                           0.86     35962
   macro avg       0.86      0.86      0.86     35962
weighted avg       0.86      0.86      0.86     35962

0.8633835715477448


In [None]:
processed_w2v.select('description','cleanTokens.result').show(truncate=50)

+--------------------------------------------------+--------------------------------------------------+
|                                       description|                                            result|
+--------------------------------------------------+--------------------------------------------------+
| Short sellers, Wall Street's dwindling band of...|[Short, sellers, Wall, Streets, dwindling, band...|
| Private investment firm Carlyle Group, which h...|[Private, investment, firm, Carlyle, Group, rep...|
| Soaring crude prices plus worries about the ec...|[Soaring, crude, prices, plus, worries, economy...|
| Authorities have halted oil export flows from ...|[Authorities, halted, oil, export, flows, main,...|
| Tearaway world oil prices, toppling records an...|[Tearaway, world, oil, prices, toppling, record...|
| Stocks ended slightly higher on Friday but sta...|[Stocks, ended, slightly, higher, Friday, staye...|
| Assets of the nation's retail money market mut...|[Assets, nat

## LogReg with Spark NLP Bert Embeddings

In [None]:
document_assembler = DocumentAssembler() \
      .setInputCol("description") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")
    
normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

bert_embeddings = BertEmbeddings\
      .pretrained('bert_base_cased', 'en') \
      .setInputCols(["document",'cleanTokens'])\
      .setOutputCol("bert")\
      .setCaseSensitive(False)\

embeddingsSentence = SentenceEmbeddings() \
      .setInputCols(["document", "bert"]) \
      .setOutputCol("sentence_embeddings") \
      .setPoolingStrategy("AVERAGE")
    
embeddings_finisher = EmbeddingsFinisher() \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCols(["finished_sentence_embeddings"]) \
      .setOutputAsVector(True)\
      .setCleanAnnotations(False)

label_stringIdx = StringIndexer(inputCol = "category", outputCol = "label")


nlp_pipeline_bert = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            bert_embeddings,
            embeddingsSentence,
            embeddings_finisher,
            label_stringIdx])

nlp_model_bert = nlp_pipeline_bert.fit(newsDF)

processed_bert = nlp_model_bert.transform(newsDF)

processed_bert.count()


bert_base_cased download started this may take some time.
Approximate size to download 389.2 MB
[OK!]


120000

In [None]:
from pyspark.sql.functions import explode

processed_bert= processed_bert.withColumn("features", explode(processed_bert.finished_sentence_embeddings))

processed_bert.select('description','features','label').show()


+--------------------+--------------------+-----+
|         description|            features|label|
+--------------------+--------------------+-----+
|Srinagar, Nov 6 (...|[-0.0763546451926...|  2.0|
|France's presiden...|[0.01601043716073...|  2.0|
|President  Bush s...|[0.11258428543806...|  2.0|
|Established Shiit...|[0.09958435595035...|  2.0|
|While Democrats p...|[-0.3666543066501...|  2.0|
|Rural and deprive...|[0.08482994884252...|  1.0|
| Terrell Owens is...|[-0.1571628898382...|  3.0|
|" Gov. Ed Rendell...|[-0.0437468327581...|  3.0|
| A month after a ...|[-0.1684152632951...|  3.0|
| No Diana Taurasi...|[-0.0047841807827...|  3.0|
| An upbeat Presid...|[0.15349867939949...|  2.0|
| Gay and lesbian ...|[0.17594610154628...|  2.0|
| Twenty three peo...|[-0.0070635229349...|  2.0|
|  Connecticut Att...|[0.13604542613029...|  0.0|
|A new report on g...|[0.07444920390844...|  1.0|
|That Michael Siew...|[0.23243072628974...|  1.0|
|Vice chairman of ...|[-0.2215369194746...|  1.0|


In [None]:
# set seed for reproducibility
(trainingData, testData) = processed_bert.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 84045
Test Dataset Count: 35955


In [None]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)

lrModel = lr.fit(trainingData)


In [None]:
from pyspark.sql.functions import udf

@udf("long")
def num_nonzeros(v):
    return v.numNonzeros()

testData = testData.where(num_nonzeros("features") != 0)

In [None]:
predictions = lrModel.transform(testData)

predictions.select("description","category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)


+------------------------------+--------+------------------------------+-----+----------+
|                   description|category|                   probability|label|prediction|
+------------------------------+--------+------------------------------+-----+----------+
|The Securities and Exchange...|Business|[0.9967407593636138,0.00300...|  0.0|       0.0|
|Stocks opened higher today,...|Business|[0.9928207319563264,0.00469...|  0.0|       0.0|
| Retailer Payless ShoeSourc...|Business|[0.9926546087578139,0.00674...|  0.0|       0.0|
|The insurance brokerage rep...|Business|[0.9917833732987117,0.00754...|  0.0|       0.0|
|Shell outlined a profit str...|Business|[0.9916303454148256,0.00808...|  0.0|       0.0|
| Countrywide Financial Corp...|Business|[0.9916172364634749,0.00514...|  0.0|       0.0|
|PITTSBURGH Mellon Financial...|Business|[0.9915578428166462,0.00799...|  0.0|       0.0|
|  Grocery wholesaler Flemin...|Business|[0.9915445608575104,0.00766...|  0.0|       0.0|
|Mark Head

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import pandas as pd

df = predictions.select('description','category','label','prediction').toPandas()

print(classification_report(df.label, df.prediction))
print(accuracy_score(df.label, df.prediction))

              precision    recall  f1-score   support

         0.0       0.82      0.79      0.80      8911
         1.0       0.81      0.80      0.81      8972
         2.0       0.84      0.86      0.85      9008
         3.0       0.90      0.94      0.92      9063

    accuracy                           0.85     35954
   macro avg       0.84      0.85      0.84     35954
weighted avg       0.84      0.85      0.85     35954

0.8459142237303221


## LogReg with ELMO Embeddings

In [None]:
document_assembler = DocumentAssembler() \
      .setInputCol("description") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")
    
normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

elmo_embeddings = ElmoEmbeddings.pretrained()\
      .setPoolingLayer("word_emb")\
      .setInputCols(["document",'cleanTokens'])\
      .setOutputCol("elmo")

embeddingsSentence = SentenceEmbeddings() \
      .setInputCols(["document", "elmo"]) \
      .setOutputCol("sentence_embeddings") \
      .setPoolingStrategy("AVERAGE")
    
embeddings_finisher = EmbeddingsFinisher() \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCols(["finished_sentence_embeddings"]) \
      .setOutputAsVector(True)\
      .setCleanAnnotations(False)

label_stringIdx = StringIndexer(inputCol = "category", outputCol = "label")


nlp_pipeline_elmo = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            elmo_embeddings,
            embeddingsSentence,
            embeddings_finisher,
            label_stringIdx])

nlp_model_elmo = nlp_pipeline_elmo.fit(newsDF)

processed_elmo = nlp_model_elmo.transform(newsDF)

processed_elmo.count()


elmo download started this may take some time.
Approximate size to download 334.1 MB
[OK!]


120000

In [None]:
(trainingData, testData) = newsDF.randomSplit([0.7, 0.3], seed = 100)

In [None]:
processed_trainingData = nlp_model_elmo.transform(trainingData)

processed_trainingData.count()

84038

In [None]:
processed_testData = nlp_model_elmo.transform(testData)

processed_testData.count()

35962

In [None]:
processed_trainingData.columns

['category',
 'description',
 'document',
 'token',
 'normalized',
 'cleanTokens',
 'elmo',
 'sentence_embeddings',
 'finished_sentence_embeddings',
 'label']

In [None]:
processed_testData= processed_testData.withColumn("features", explode(processed_testData.finished_sentence_embeddings))

processed_trainingData= processed_trainingData.withColumn("features", explode(processed_trainingData.finished_sentence_embeddings))


In [None]:
from pyspark.sql.functions import udf

@udf("long")
def num_nonzeros(v):
    return v.numNonzeros()

processed_testData = processed_testData.where(num_nonzeros("features") != 0)

In [None]:
%%time

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)

lrModel = lr.fit(processed_trainingData)


CPU times: user 14.5 s, sys: 1.12 s, total: 15.6 s
Wall time: 51min 41s


In [None]:
processed_trainingData.columns

['category',
 'description',
 'document',
 'token',
 'normalized',
 'cleanTokens',
 'elmo',
 'sentence_embeddings',
 'finished_sentence_embeddings',
 'label',
 'features']

In [None]:
predictions = lrModel.transform(processed_testData)

predictions.select("description","category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)


+------------------------------+--------+------------------------------+-----+----------+
|                   description|category|                   probability|label|prediction|
+------------------------------+--------+------------------------------+-----+----------+
|A forthcoming version of Wi...|Sci/Tech|[0.9993039099892396,4.50871...|  0.0|       0.0|
|The KDE Project has release...|Sci/Tech|[0.999246791111875,6.973488...|  0.0|       0.0|
|What do Internet Explorer, ...|Sci/Tech|[0.9992460656674004,6.07912...|  0.0|       0.0|
|Server: requires Microsoft ...|Sci/Tech|[0.9987961954234382,0.00112...|  0.0|       0.0|
|Plan promises downloads of ...|Sci/Tech|[0.9985562812084444,0.00117...|  0.0|       0.0|
|Mozilla Foundation, creator...|Sci/Tech|[0.9985218038953471,0.00135...|  0.0|       0.0|
|Microsoft introduced a beta...|Sci/Tech|[0.998473641569325,0.001387...|  0.0|       0.0|
| A software patch to protec...|Sci/Tech|[0.9983135752364677,0.00131...|  0.0|       0.0|
| Device u

In [None]:
df = predictions.select('description','category','label','prediction').toPandas()

In [None]:
df.shape

(35962, 4)

In [None]:
df.head()

Unnamed: 0,description,category,label,prediction
0,A $120 million fine levied on Royal Dutch/S...,Business,1.0,1.0
1,A Missouri woman is suing the maker of arthr...,Business,1.0,2.0
2,A Pennsylvania brewery is betting beer drink...,Business,1.0,1.0
3,A Secret Service ink expert was acquitted ye...,Business,1.0,2.0
4,A federal bankruptcy judge ruled against Uni...,Business,1.0,1.0


In [None]:
from sklearn.metrics import classification_report, accuracy_score

print(classification_report(df.label, df.prediction))
print(accuracy_score(df.label, df.prediction))

              precision    recall  f1-score   support

         0.0       0.83      0.82      0.82      8950
         1.0       0.83      0.82      0.82      8908
         2.0       0.88      0.87      0.88      9018
         3.0       0.94      0.96      0.95      9086

    accuracy                           0.87     35962
   macro avg       0.87      0.87      0.87     35962
weighted avg       0.87      0.87      0.87     35962

0.8685000834213893


## LogReg with Universal Sentence Encoder

In [None]:
useEmbeddings = UniversalSentenceEncoder.pretrained()\
      .setInputCols("document")\
      .setOutputCol("use_embeddings")

In [None]:
document_assembler = DocumentAssembler() \
      .setInputCol("description") \
      .setOutputCol("document")

loaded_useEmbeddings = UniversalSentenceEncoder.load('/root/cache_pretrained/tfhub_use_en_2.4.0_2.4_1587136330099')\
      .setInputCols("document")\
      .setOutputCol("use_embeddings")

embeddings_finisher = EmbeddingsFinisher() \
      .setInputCols(["use_embeddings"]) \
      .setOutputCols(["finished_use_embeddings"]) \
      .setOutputAsVector(True)\
      .setCleanAnnotations(False)

label_stringIdx = StringIndexer(inputCol = "category", outputCol = "label")

use_pipeline = Pipeline(
      stages=[
        document_assembler,
        loaded_useEmbeddings,
        embeddings_finisher,
        label_stringIdx]
      )

use_df = use_pipeline.fit(newsDF).transform(newsDF)

In [None]:
use_df.select('finished_use_embeddings').show(3)

+-----------------------+
|finished_use_embeddings|
+-----------------------+
|   [[0.0441501587629...|
|   [[0.0844451636075...|
|   [[0.0426647365093...|
+-----------------------+
only showing top 3 rows



In [None]:
from pyspark.sql.functions import explode

use_df= use_df.withColumn("features", explode(use_df.finished_use_embeddings))

In [None]:
use_df.show(2)

+--------+--------------------+--------------------+--------------------+-----------------------+-----+--------------------+
|category|         description|            document|      use_embeddings|finished_use_embeddings|label|            features|
+--------+--------------------+--------------------+--------------------+-----------------------+-----+--------------------+
|Business| Short sellers, W...|[[document, 0, 84...|[[sentence_embedd...|   [[0.0441501587629...|  1.0|[0.04415015876293...|
|Business| Private investme...|[[document, 0, 20...|[[sentence_embedd...|   [[0.0844451636075...|  1.0|[0.08444516360759...|
+--------+--------------------+--------------------+--------------------+-----------------------+-----+--------------------+
only showing top 2 rows



In [None]:
# set seed for reproducibility
(trainingData, testData) = use_df.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 84038
Test Dataset Count: 35962


In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import pandas as pd

from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)

lrModel = lr.fit(trainingData)

predictions = lrModel.transform(testData)

predictions.filter(predictions['prediction'] == 0) \
    .select("description","category","probability","label","prediction") \
    .orderBy("probability", ascending=False) \
    .show(n = 10, truncate = 30)


+------------------------------+--------+------------------------------+-----+----------+
|                   description|category|                   probability|label|prediction|
+------------------------------+--------+------------------------------+-----+----------+
|Desktop Search from AOL, Go...|Sci/Tech|[0.9960102774312224,0.00273...|  0.0|       0.0|
|Humans will upgrade their n...|Sci/Tech|[0.9958870851755632,0.00153...|  0.0|       0.0|
| Downloading games, ring to...|Sci/Tech|[0.9953640108788133,0.00126...|  0.0|       0.0|
|Google brought the simplici...|Sci/Tech|[0.9947117246322481,0.00283...|  0.0|       0.0|
|The Mountain View company u...|Business|[0.9946224759216548,0.00335...|  1.0|       0.0|
|Free application creates to...|Sci/Tech|[0.9944398317988099,0.00138...|  0.0|       0.0|
|Users using Symbian operati...|Sci/Tech|[0.9942926689881441,0.00258...|  0.0|       0.0|
|Internet portal Lycos has d...|Sci/Tech|[0.9940289940853572,0.00450...|  0.0|       0.0|
|OQO, a ti

In [None]:
df = predictions.select('description','category','label','prediction').toPandas()
#df['result'] = df['result'].apply(lambda x: x[0])


In [None]:
df.head()

Unnamed: 0,description,category,label,prediction
0,A $120 million fine levied on Royal Dutch/S...,Business,1.0,1.0
1,A Missouri woman is suing the maker of arthr...,Business,1.0,1.0
2,A Pennsylvania brewery is betting beer drink...,Business,1.0,1.0
3,A Secret Service ink expert was acquitted ye...,Business,1.0,2.0
4,A federal bankruptcy judge ruled against Uni...,Business,1.0,1.0


In [None]:

print(classification_report(df.label, df.prediction))
print(accuracy_score(df.label, df.prediction))

              precision    recall  f1-score   support

         0.0       0.84      0.84      0.84      8950
         1.0       0.83      0.83      0.83      8908
         2.0       0.90      0.88      0.89      9018
         3.0       0.95      0.97      0.96      9086

    accuracy                           0.88     35962
   macro avg       0.88      0.88      0.88     35962
weighted avg       0.88      0.88      0.88     35962

0.8825704910739114


### train on entire dataset

In [None]:
lr = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)

lrModel = lr.fit(use_df)

In [None]:
test_df = spark.read.parquet("data/news_category_test.parquet")

In [None]:
test_df = use_pipeline.fit(test_df).transform(test_df)

In [None]:
test_df= test_df.withColumn("features", explode(test_df.finished_use_embeddings))

In [None]:
test_df.show(2)

+--------+--------------------+--------------------+--------------------+-----------------------+-----+--------------------+
|category|         description|            document|      use_embeddings|finished_use_embeddings|label|            features|
+--------+--------------------+--------------------+--------------------+-----------------------+-----+--------------------+
|Business|Unions representi...|[[document, 0, 12...|[[sentence_embedd...|   [[0.0129975397139...|  1.0|[0.01299753971397...|
|Sci/Tech| TORONTO, Canada ...|[[document, 0, 22...|[[sentence_embedd...|   [[0.0019999044016...|  0.0|[0.00199990440160...|
+--------+--------------------+--------------------+--------------------+-----------------------+-----+--------------------+
only showing top 2 rows



In [None]:
predictions = lrModel.transform(test_df)

In [None]:
df = predictions.select('description','category','label','prediction').toPandas()

In [None]:
df['label'] = df.category.replace({'World':2.0,
                    'Sports':3.0,
                    'Business':0.0,
                    'Sci/Tech':1.0})

In [None]:
df.head()

Unnamed: 0,description,category,label,prediction
0,Unions representing workers at Turner Newall...,Business,0.0,0.0
1,"TORONTO, Canada A second team of rocketeer...",Sci/Tech,1.0,1.0
2,A company founded by a chemistry researcher a...,Sci/Tech,1.0,1.0
3,It's barely dawn when Mike Fitzpatrick starts...,Sci/Tech,1.0,1.0
4,Southern California's smog fighting agency we...,Sci/Tech,1.0,0.0


In [None]:
print(classification_report(df.label, df.prediction))
print(accuracy_score(df.label, df.prediction))

              precision    recall  f1-score   support

         0.0       0.83      0.83      0.83      1900
         1.0       0.84      0.85      0.85      1900
         2.0       0.90      0.87      0.89      1900
         3.0       0.95      0.97      0.96      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600

0.8798684210526316


## Spark NLP Licensed DocClassifier

In [None]:
from sparknlp_jsl.annotator import *

In [None]:
# set seed for reproducibility
(trainingData, testData) = newsDF.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 84076
Test Dataset Count: 35924


In [None]:
document_assembler = DocumentAssembler() \
      .setInputCol("description") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")
    
normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("stem")

logreg = DocumentLogRegClassifierApproach()\
      .setInputCols(["stem"])\
      .setLabelCol("category")\
      .setOutputCol("prediction")

nlp_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            stemmer, 
            logreg])

nlp_model = nlp_pipeline.fit(trainingData)

processed = nlp_model.transform(testData)

processed.count()

35923

In [None]:
processed.select('description','category','prediction.result').show(truncate=50)

+--------------------------------------------------+--------+----------+
|                                       description|category|    result|
+--------------------------------------------------+--------+----------+
|  In a city where terror attacks and a massive ...|Business|[Business]|
|  It sure isn #39;t the Goldilocks Economy of y...|Business|[Business]|
|, 8/30/2004. With 90 nanometer chips now on the...|Business|[Sci/Tech]|
|National Grid Transco, the Britain-based delive...|Business|[Business]|
| quot;A person who has been cheated is left in ...|Sci/Tech|[Sci/Tech]|
|" In its ongoing war with SCO over Linux and Un...|Sci/Tech|[Sci/Tech]|
|A bacteria-eating virus is the star of a new vi...|Sci/Tech|[Sci/Tech]|
|Birdman of Belair Mathew Tekulsky waxes on the ...|Sci/Tech|[Sci/Tech]|
|Computer maker sees to recover \$8.6 million in...|Sci/Tech|[Sci/Tech]|
|Hurricane Frances spared NASA #39;s depleted sh...|Sci/Tech|[Sci/Tech]|
|In a study, the now-public search engine out-ra...

In [None]:
processed.select('description','prediction.result').show(truncate=50)

+--------------------------------------------------+----------+
|                                       description|    result|
+--------------------------------------------------+----------+
|  In a city where terror attacks and a massive ...|[Business]|
|  It sure isn #39;t the Goldilocks Economy of y...|[Business]|
|, 8/30/2004. With 90 nanometer chips now on the...|[Sci/Tech]|
|National Grid Transco, the Britain-based delive...|[Business]|
| quot;A person who has been cheated is left in ...|[Sci/Tech]|
|" In its ongoing war with SCO over Linux and Un...|[Sci/Tech]|
|A bacteria-eating virus is the star of a new vi...|[Sci/Tech]|
|Birdman of Belair Mathew Tekulsky waxes on the ...|[Sci/Tech]|
|Computer maker sees to recover \$8.6 million in...|[Sci/Tech]|
|Hurricane Frances spared NASA #39;s depleted sh...|[Sci/Tech]|
|In a study, the now-public search engine out-ra...|[Sci/Tech]|
|New York, August 31: US technology executives a...|[Sci/Tech]|
|Ordinary mice can be turned into marath

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import pandas as pd

In [None]:
df = processed.select('description','category','prediction.result').toPandas()

In [None]:
df.head()

Unnamed: 0,description,category,result
0,In a city where terror attacks and a massive...,Business,[Business]
1,It sure isn #39;t the Goldilocks Economy of ...,Business,[Business]
2,", 8/30/2004. With 90 nanometer chips now on th...",Business,[Sci/Tech]
3,"National Grid Transco, the Britain-based deliv...",Business,[Business]
4,quot;A person who has been cheated is left in...,Sci/Tech,[Sci/Tech]


In [None]:
df.result[0][0]

'Business'

In [None]:
df = processed.select('description','category','prediction.result').toPandas()
df['result'] = df['result'].apply(lambda x: x[0])

In [None]:
df.head()

Unnamed: 0,description,category,result
0,In a city where terror attacks and a massive...,Business,Business
1,It sure isn #39;t the Goldilocks Economy of ...,Business,Business
2,", 8/30/2004. With 90 nanometer chips now on th...",Business,Sci/Tech
3,"National Grid Transco, the Britain-based deliv...",Business,Business
4,quot;A person who has been cheated is left in...,Sci/Tech,Sci/Tech


In [None]:

df = processed.select('description','category','prediction.result').toPandas()
df['result'] = df['result'].apply(lambda x: x[0])

print(classification_report(df.category, df.result))
print(accuracy_score(df.category, df.result))

              precision    recall  f1-score   support

    Business       0.82      0.82      0.82      8915
    Sci/Tech       0.83      0.83      0.83      9018
      Sports       0.94      0.93      0.93      9002
       World       0.86      0.86      0.86      8988

    accuracy                           0.86     35923
   macro avg       0.86      0.86      0.86     35923
weighted avg       0.86      0.86      0.86     35923

0.8612588035520419


# ClassifierDL

In [None]:
# actual content is inside description column
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.load('/root/cache_pretrained/tfhub_use_en_2.4.4_2.4_1583158595769')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

# the classes/labels/categories are in category column
classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("category")\
    .setMaxEpochs(5)\
    .setEnableOutputLogs(True)

pipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])

In [None]:
# set seed for reproducibility
(trainingData, testData) = newsDF.randomSplit([0.7, 0.3], seed = 100)
print("Training Dataset Count: " + str(trainingData.count()))
print("Test Dataset Count: " + str(testData.count()))

Training Dataset Count: 84045
Test Dataset Count: 35955


In [None]:
pipelineModel = pipeline.fit(trainingData)

In [None]:
from sklearn.metrics import classification_report, accuracy_score

df = pipelineModel.transform(testDataset).select('category','description',"class.result").toPandas()

df['result'] = df['result'].apply(lambda x: x[0])

print(classification_report(df.category, df.result))
print(accuracy_score(df.category, df.result))

              precision    recall  f1-score   support

    Business       0.85      0.84      0.85      8911
    Sci/Tech       0.85      0.87      0.86      8973
      Sports       0.95      0.98      0.97      9063
       World       0.92      0.88      0.90      9008

    accuracy                           0.89     35955
   macro avg       0.89      0.89      0.89     35955
weighted avg       0.89      0.89      0.89     35955

0.8930329578639966


## Loading the trained classifier from disk

In [None]:
classsifierdlmodel = ClassifierDLModel.load('classifierDL_model_20200317_5e')
 

In [None]:
import sparknlp
sparknlp.__path__

In [None]:
.setInputCols(["sentence_embeddings"])\
.setOutputCol("class")\
.setLabelColumn("category")\
.setMaxEpochs(5)\
.setEnableOutputLogs(True)

In [None]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("data/news_category_train.csv")

In [None]:
trainDataset.count()

120000

In [None]:
trainingData.count()

84045

In [None]:
document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")


sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

use = UniversalSentenceEncoder.load('/root/cache_pretrained/tfhub_use_en_2.4.4_2.4_1583158595769')\
    .setInputCols(["sentence"])\
    .setOutputCol("sentence_embeddings")

classsifierdlmodel = ClassifierDLModel.load('classifierDL_model_20200317_5e')

pipeline = Pipeline(
    stages = [
        document,
        sentence,
        use,
        classsifierdlmodel
    ])

In [None]:
pipeline.fit(testData.limit(1)).transform(testData.limit(10)).select('category','description',"class.result").show(10, truncate=50)

+--------+--------------------------------------------------+----------+
|category|                                       description|    result|
+--------+--------------------------------------------------+----------+
|Business|  A federal judge on Monday stayed his own ruli...|[Business]|
|Business|  A half dozen executives of Yukos, the embattl...|[Business]|
|Business|  A labor dispute may sideline professional hoc...|[Business]|
|Business|  A ruling from the World Trade Organization co...|[Business]|
|Business|  American Airlines has unveiled a new simplifi...|[Business]|
|Business|  Anglo Aussie miner BHP Billiton (BHP) (UK:BLT...|[Business]|
|Business|  Another group of investors hit beleaguered mo...|[Business]|
|Business|  At a sponsors' meeting of MIT Sloan School's ...|[Business]|
|Business|  Blockbuster Inc. wants to acquire rival Holly...|[Business]|
|Business|  Bolstered by investors, Oracle Corp. appears ...|[Business]|
+--------+-----------------------------------------

In [None]:
lm = LightPipeline(pipeline.fit(testDataset.limit(1)))
lm.annotate('In its first two years, the UK dedicated card companies have surge')

{'document': ['In its first two years, the UK dedicated card companies have surge'],
 'sentence_embeddings': ['In its first two years, the UK dedicated card companies have surge'],
 'class': ['Sci/Tech']}

In [None]:
text='''
Fearing the fate of Italy, the centre-right government has threatened to be merciless with those who flout tough restrictions. As of Wednesday it will also include all shops being closed across Greece, with the exception of supermarkets. Banks, pharmacies, pet-stores, mobile phone stores, opticians, bakers, mini-markets, couriers and food delivery outlets are among the few that will also be allowed to remain open.
'''

In [None]:
lm = LightPipeline(pipeline.fit(testDataset.limit(1)))

lm.annotate(text)

{'document': ['\nFearing the fate of Italy, the centre-right government has threatened to be merciless with those who flout tough restrictions. As of Wednesday it will also include all shops being closed across Greece, with the exception of supermarkets. Banks, pharmacies, pet-stores, mobile phone stores, opticians, bakers, mini-markets, couriers and food delivery outlets are among the few that will also be allowed to remain open.\n'],
 'sentence': ['Fearing the fate of Italy, the centre-right government has threatened to be merciless with those who flout tough restrictions.',
  'As of Wednesday it will also include all shops being closed across Greece, with the exception of supermarkets.',
  'Banks, pharmacies, pet-stores, mobile phone stores, opticians, bakers, mini-markets, couriers and food delivery outlets are among the few that will also be allowed to remain open.'],
 'sentence_embeddings': ['Fearing the fate of Italy, the centre-right government has threatened to be merciless wi

# Classifier DL + Glove + Basic text processing

In [None]:
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")

lemma = LemmatizerModel.pretrained('lemma_antbnc') \
      .setInputCols(["token"]) \
      .setOutputCol("lemma")

lemma_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            lemma,
            glove_embeddings])

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]


In [None]:
lemma_pipeline.fit(trainingData.limit(1000)).transform(trainingData.limit(1000)).show(truncate=30)

+--------+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+
|category|                   description|                      document|                         token|                         lemma|                    embeddings|
+--------+------------------------------+------------------------------+------------------------------+------------------------------+------------------------------+
|Business|  #39;Tis the season to buy...|[[document, 0, 141,   #39;T...|[[token, 2, 8, #39;Tis, [se...|[[token, 2, 8, #39;Tis, [se...|[[word_embeddings, 2, 8, #3...|
|Business|  A Delaware judge rejected...|[[document, 0, 161,   A Del...|[[token, 2, 2, A, [sentence...|[[token, 2, 2, A, [sentence...|[[word_embeddings, 2, 2, A,...|
|Business|  A Food and Drug Administr...|[[document, 0, 140,   A Foo...|[[token, 2, 2, A, [sentence...|[[token, 2, 2, A, [sentence...|[[word_embeddings, 2, 2, A,...|
|Bus

In [None]:
document_assembler = DocumentAssembler() \
      .setInputCol("description") \
      .setOutputCol("document")
    
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")
    
normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

lemma = LemmatizerModel.pretrained('lemma_antbnc') \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("lemma")

glove_embeddings = WordEmbeddingsModel().pretrained() \
      .setInputCols(["document",'lemma'])\
      .setOutputCol("embeddings")\
      .setCaseSensitive(False)

embeddingsSentence = SentenceEmbeddings() \
      .setInputCols(["document", "embeddings"]) \
      .setOutputCol("sentence_embeddings") \
      .setPoolingStrategy("AVERAGE")

classsifierdl = ClassifierDLApproach()\
      .setInputCols(["sentence_embeddings"])\
      .setOutputCol("class")\
      .setLabelColumn("category")\
      .setMaxEpochs(10)\
      .setEnableOutputLogs(True)

clf_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            lemma, 
            glove_embeddings,
            embeddingsSentence,
            classsifierdl])

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
!rm -rf classifier_dl_pipeline_glove

In [None]:
clf_pipelineModel.save('classifier_dl_pipeline_glove')

In [None]:
clf_pipelineModel = clf_pipeline.fit(trainingData)

In [None]:
df = clf_pipelineModel.transform(testDataset).select('category','description',"class.result").toPandas()

df['result'] = df['result'].apply(lambda x: x[0])

print(classification_report(df.category, df.result))

print(accuracy_score(df.category, df.result))

              precision    recall  f1-score   support

    Business       0.85      0.82      0.83      8911
    Sci/Tech       0.81      0.89      0.85      8973
      Sports       0.95      0.97      0.96      9063
       World       0.92      0.86      0.89      9008

    accuracy                           0.88     35955
   macro avg       0.88      0.88      0.88     35955
weighted avg       0.88      0.88      0.88     35955

0.8809066889167014


In [None]:
!cd data && ls -l

In [None]:
import pandas as pd
import

In [None]:
news_df = newsDF.toPandas()

In [None]:
news_df.head()

Unnamed: 0,category,description
0,World,"Srinagar, Nov 6 (UNI) Two militants and a Bord..."
1,World,France's president orders his forces to destro...
2,World,President Bush says he will reach out to alli...
3,World,Established Shiite parties and powerful upstar...
4,World,While Democrats placed their emphasis on the s...


In [None]:
news_df.to_csv('data/news_dataset.csv', index=False)

In [None]:
document_assembler = DocumentAssembler() \
      .setInputCol("description") \
      .setOutputCol("document")
      
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")
      
normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

lemma = LemmatizerModel.pretrained('lemma_antbnc') \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("lemma")

glove_embeddings = WordEmbeddingsModel().pretrained() \
      .setInputCols(["document",'lemma'])\
      .setOutputCol("embeddings")\
      .setCaseSensitive(False)

txt_pipeline = Pipeline(
    stages=[document_assembler, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            lemma, 
            glove_embeddings,
            embeddingsSentence])

lemma_antbnc download started this may take some time.
Approximate size to download 907.6 KB
[OK!]
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
[OK!]


In [None]:
txt_pipelineModel = txt_pipeline.fit(testData.limit(1))

In [None]:
txt_pipelineModel.save('text_prep_pipeline_glove')

In [None]:
df.head()

Unnamed: 0,category,description,result
0,Business,A federal judge on Monday stayed his own rul...,Business
1,Business,"A half dozen executives of Yukos, the embatt...",Business
2,Business,A labor dispute may sideline professional ho...,Sports
3,Business,A ruling from the World Trade Organization c...,Sci/Tech
4,Business,American Airlines has unveiled a new simplif...,Business
