# Ex3: NLP - Tags 

### Requirement: Build a tags filter. Use the various NLP tools and a classifier, to predict tag for one question.  In future questions could be auto-tagged by such a classifier or tags could be recommended to users prior to posting.
- Dataset: stack-overflow-data.csv. It contains Stack Overflow questions and associated tags.
- Link tham khảo: http://benalexkeen.com/multiclass-text-classification-with-pyspark/

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark

In [3]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark import SparkConf

In [4]:
# SparkContext.setSystemProperty('spark.executor.memory', '12g')
# sc = SparkContext(master='spark://172.25.51.55:7077', appName='Stack_Overflow')
# spark = SparkSession(sc)
spark = SparkSession \
  .builder \
  .master("local[*]")\
  .appName("Big_data_test") \
  .config("spark.memory.fraction", 0.8) \
  .config("spark.executor.memory", "10g") \
  .config("spark.driver.memory", "10g")\
  .config("spark.sql.shuffle.partitions" , "800") \
  .config("spark.memory.offHeap.enabled",'true')\
  .config("spark.memory.offHeap.size","10g")\
  .getOrCreate()

In [5]:
spark

In [6]:
file_name = "stack_overflow_data.csv"
# file_name = "stack-overflow-data.csv"

In [7]:
data = spark.read.csv(file_name, inferSchema=True,header=True)

In [8]:
data.show(5)

+--------------------+-----------+
|                post|       tags|
+--------------------+-----------+
|what is causing t...|         c#|
|have dynamic html...|    asp.net|
|how to convert a ...|objective-c|
|.net framework 4 ...|       .net|
|trying to calcula...|     python|
+--------------------+-----------+
only showing top 5 rows



In [9]:
data.groupby('tags').count().show(30)

+-------------+-----+
|         tags|count|
+-------------+-----+
|         null|20798|
|            c| 2000|
|       python| 2000|
|ruby-on-rails| 2000|
|       iphone| 2000|
|           c#| 2000|
|       jquery| 2000|
|   javascript| 2000|
|          php| 2000|
|          ios| 2000|
|      android| 2000|
|      asp.net| 2000|
|         html| 2000|
|        mysql| 2000|
|          css| 2000|
|          sql| 2000|
|  objective-c| 2000|
|         .net| 2000|
|          c++| 2000|
|         java| 2000|
|    angularjs| 2000|
+-------------+-----+



In [10]:
tags_null_data = data.filter(data.tags.isNull())

In [11]:
tags_null_data.count()

20798

In [12]:
data = data.filter(data.tags.isNotNull())

In [13]:
data.count()

40000

In [14]:
from pyspark.sql.functions import *

## Clean and Prepare the Data

** Create a new length feature: **

In [15]:
from pyspark.sql.functions import length

In [16]:
data = data.withColumn('length',length(data['post']))

In [17]:
data.show()

+--------------------+-------------+------+
|                post|         tags|length|
+--------------------+-------------+------+
|what is causing t...|           c#|   833|
|have dynamic html...|      asp.net|   804|
|how to convert a ...|  objective-c|   755|
|.net framework 4 ...|         .net|   349|
|trying to calcula...|       python|  1290|
|how to give alias...|      asp.net|   309|
|window.open() ret...|    angularjs|   495|
|identifying serve...|       iphone|   424|
|unknown method ke...|ruby-on-rails|  2022|
|from the include ...|    angularjs|  1279|
|when we need inte...|           c#|   995|
|how to install .i...|          ios|   344|
|dynamic textbox t...|      asp.net|   389|
|rather than bubbl...|            c|  1338|
|site deployed in ...|      asp.net|   349|
|connection in .ne...|         .net|   228|
|how to subtract 1...|  objective-c|    62|
|ror console show ...|ruby-on-rails|  2594|
|distance between ...|       iphone|   336|
|sql query - how t...|          

In [18]:
# Pretty Clear Difference
data.groupby('tags').mean().show()

+-------------+-----------+
|         tags|avg(length)|
+-------------+-----------+
|            c|  1121.1115|
|       python|  1018.6695|
|ruby-on-rails|  1244.2055|
|       iphone|    709.621|
|           c#|  1145.3065|
|       jquery|   1081.507|
|   javascript|    964.396|
|          php|  1123.4205|
|          ios|   970.7565|
|      android|  1713.4345|
|      asp.net|     999.95|
|         html|   891.3105|
|        mysql|   1038.561|
|          css|    954.809|
|          sql|    870.912|
|  objective-c|   972.8925|
|         .net|   731.0075|
|          c++|   1295.955|
|         java|   1357.308|
|    angularjs|  1294.7545|
+-------------+-----------+



## Feature Transformations

In [19]:
from bs4 import BeautifulSoup

from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

In [20]:
class BsTextExtractor(Transformer, HasInputCol, HasOutputCol):

    @keyword_only
    def __init__(self, inputCol=None, outputCol=None):
        super(BsTextExtractor, self).__init__()
        kwargs = self._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, inputCol=None, outputCol=None):
        kwargs = self._input_kwargs
        return self._set(**kwargs)

    def _transform(self, dataset):

        def f(s):
            cleaned_post = BeautifulSoup(s).text
            return cleaned_post

        t = StringType()
        out_col = self.getOutputCol()
        in_col = dataset[self.getInputCol()]
        return dataset.withColumn(out_col, udf(f, t)(in_col))

In [21]:
from pyspark.ml.feature import Tokenizer,StopWordsRemover, CountVectorizer,IDF,StringIndexer
text_extractor = BsTextExtractor(inputCol="post", outputCol="cleaned_post")
tokenizer = Tokenizer(inputCol="cleaned_post", outputCol="token_text")
stopremove = StopWordsRemover(inputCol='token_text',outputCol='stop_tokens')
count_vec = CountVectorizer(inputCol='stop_tokens',outputCol='c_vec')
idf = IDF(inputCol="c_vec", outputCol="tf_idf")
class_to_num = StringIndexer(inputCol='tags',outputCol='label')

In [22]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector

In [23]:
clean_up = VectorAssembler(inputCols=['tf_idf','length'],outputCol='features')

### The Model

We'll use Naive Bayes, but feel free to play around with this choice!

In [24]:
from pyspark.ml.classification import NaiveBayes

In [25]:
# Use defaults
nb = NaiveBayes()

### Pipeline

In [26]:
from pyspark.ml import Pipeline

In [27]:
data_prep_pipe = Pipeline(stages=[class_to_num,text_extractor,tokenizer,stopremove,count_vec,idf,clean_up])

In [28]:
cleaner = data_prep_pipe.fit(data)

In [29]:
clean_data = cleaner.transform(data)

### Training and Evaluation!

In [30]:
clean_data = clean_data.select(['label','features'])

In [31]:
clean_data.show() 

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  5.0|(262145,[0,1,2,3,...|
|  3.0|(262145,[0,12,31,...|
| 15.0|(262145,[0,1,2,3,...|
|  0.0|(262145,[0,18,21,...|
| 17.0|(262145,[0,1,4,8,...|
|  3.0|(262145,[0,12,21,...|
|  2.0|(262145,[0,1,3,6,...|
| 10.0|(262145,[0,44,61,...|
| 18.0|(262145,[0,1,14,2...|
|  2.0|(262145,[0,1,3,4,...|
|  5.0|(262145,[0,2,3,6,...|
|  9.0|(262145,[0,18,27,...|
|  3.0|(262145,[0,7,12,1...|
|  4.0|(262145,[0,1,2,3,...|
|  3.0|(262145,[0,11,27,...|
|  0.0|(262145,[0,187,23...|
| 15.0|(262145,[0,10,15,...|
| 18.0|(262145,[0,1,3,12...|
| 10.0|(262145,[0,30,39,...|
| 19.0|(262145,[0,12,15,...|
+-----+--------------------+
only showing top 20 rows



In [32]:
(training,testing) = clean_data.randomSplit([0.7,0.3], seed=142)

In [33]:
#training.cache()

In [34]:
#testing.cache()

In [35]:
#training.groupBy("label").count().show()

In [36]:
#testing.groupBy("label").count().show()

In [37]:
predictor = nb.fit(training)

In [38]:
test_results = predictor.transform(testing)

In [39]:
test_results.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(262145,[0,1,2,3,...|[-9382.3336032703...|[4.14232875819524...|       5.0|
|  0.0|(262145,[0,1,2,3,...|[-10653.148484507...|[9.37560071511545...|       3.0|
|  0.0|(262145,[0,1,2,3,...|[-4988.8319686942...|[2.66200887175650...|       5.0|
|  0.0|(262145,[0,1,2,3,...|[-8907.2383399015...|[1.01187987782906...|       5.0|
|  0.0|(262145,[0,1,2,3,...|[-3033.5517325957...|[0.99999999715052...|       0.0|
|  0.0|(262145,[0,1,2,3,...|[-3534.5320734550...|[4.77408947203727...|       3.0|
|  0.0|(262145,[0,1,7,8,...|[-5743.0674312095...|[1.0,0.0,0.0,4.05...|       0.0|
|  0.0|(262145,[0,1,8,9,...|[-2238.9850765317...|[1.50622218135388...|       3.0|
|  0.0|(262145,[0,1,8,11...|[-2495.0640264210...|[5.70259202776765...|       3.0|
|  0.0|(262145,[

In [40]:
# Create a confusion matrix
test_results.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  0.0|       8.0|    5|
|  1.0|      12.0|    1|
|  7.0|       3.0|    3|
| 19.0|      12.0|    2|
| 16.0|      10.0|    1|
|  8.0|       6.0|    1|
|  3.0|       9.0|    1|
|  5.0|       8.0|    6|
|  0.0|      10.0|    4|
| 19.0|      16.0|    3|
| 16.0|       3.0|    6|
|  5.0|      12.0|    8|
| 11.0|      19.0|    4|
|  6.0|       7.0|    2|
| 10.0|       0.0|   16|
| 19.0|      11.0|    2|
|  5.0|       2.0|    5|
| 12.0|      12.0|  382|
| 18.0|      12.0|    2|
| 13.0|       0.0|    3|
+-----+----------+-----+
only showing top 20 rows



In [41]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [42]:
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting: {}".format(acc))

Accuracy of model at predicting: 0.7179726079026453


In [43]:
# save may cuc bo
# nb.save("NB_TagFilters_model")

- Not very good result! (~72%)
- Solution: Try switching out the classification models! Or even try to come up with other engineered features!...

### Use LogisticRegression/Random Forest

### Logistic Regression

In [44]:
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression

In [45]:
lg = LogisticRegression(maxIter=100, regParam=0.3, elasticNetParam=0)

In [46]:
predictor_1 = lg.fit(training)

In [47]:
test_results_1 = predictor_1.transform(testing)

In [48]:
# Create a confusion matrix
test_results_1.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  0.0|       8.0|   27|
|  1.0|      12.0|    3|
| 19.0|      12.0|    1|
|  7.0|       3.0|    4|
| 16.0|      10.0|   18|
|  8.0|       6.0|    1|
| 17.0|       9.0|    1|
|  3.0|       9.0|    1|
| 14.0|       7.0|    1|
|  5.0|       8.0|   28|
|  0.0|      10.0|   20|
| 19.0|      16.0|    2|
| 16.0|       3.0|   12|
|  5.0|      12.0|   11|
| 11.0|      19.0|   10|
|  6.0|       7.0|    4|
| 10.0|       0.0|   11|
| 19.0|      11.0|    4|
|  5.0|       2.0|    4|
| 12.0|      12.0|  334|
+-----+----------+-----+
only showing top 20 rows



In [49]:
acc_eval = MulticlassClassificationEvaluator()
acc_1 = acc_eval.evaluate(test_results_1)
print("Accuracy of model at predicting: {}".format(acc_1))

Accuracy of model at predicting: 0.7008097800813181


In [50]:
## It's not better result!!!

In [51]:
# Save máy cục bộ
# lg.save("LG_TagFilters_model")

### Random forest

In [52]:
rf = RandomForestClassifier(labelCol="label", \
                            featuresCol="features", \
                            numTrees = 500, \
                            maxDepth = 5, \
                            maxBins = 64)

In [55]:
predictor_2 = rf.fit(training)

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:57084)
Traceback (most recent call last):
  File "/Users/tranhoangbach/Documents/Spark/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/tranhoangbach/Documents/Spark/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1115, in start
    self.socket.connect((self.address, self.port))
ConnectionRefusedError: [Errno 61] Connection refused


Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:57084)

In [None]:
test_results_2 = predictor_2.transform(testing)

In [None]:
# Create a confusion matrix
test_results_2.groupBy('label', 'prediction').count().show()

In [None]:
test_results_2.groupBy('prediction').count().show()

In [None]:
acc_eval = MulticlassClassificationEvaluator()
acc_2 = acc_eval.evaluate(test_results_2)
print("Accuracy of model at predicting: {}".format(acc_2))

In [None]:
## It has higher accuracy but is not a better result!!!

In [None]:
# Save máy cục bộ
# rf.save("RF_TagFilters_model")

In [None]:
sc.stop()