**Remarks**

In this notebook, we will learn about twitter-tweet using pyspark. Finally, we read again previous lesson to generate data from kaggle. Let's learning! NOTED: If you confused about how to download dataset in kaggle, you can try manually. Here's: [Sentiment Analysis](https://www.kaggle.com/kazanova/sentiment140)

In [3]:
# installing pyspark
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.1.2.tar.gz (212.4 MB)
[K     |████████████████████████████████| 212.4 MB 60 kB/s 
[?25hCollecting py4j==0.10.9
  Downloading py4j-0.10.9-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 55.7 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=e55ed301d8e246d30e5a432073788d3ee5d50cb9a01b359ba237c44ec5fc60ff
  Stored in directory: /root/.cache/pip/wheels/a5/0a/c1/9561f6fecb759579a7d863dcd846daaa95f598744e71b02c77
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2


In [8]:
# import library
# 1. pyspark environment
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.types import *
from pyspark.ml.classification import NaiveBayes, RandomForestClassifier, LogisticRegression, DecisionTreeClassifier, GBTClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.sql.functions import *
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF

# 2. DataFrame environment
import pandas as pd
import numpy as np
import html

# 3. NLP tools
import spacy
import re
from nltk.stem import PorterStemmer

In [9]:
# load data from kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [10]:
!kaggle datasets download -d kazanova/sentiment140
!unzip sentiment140.zip

Downloading sentiment140.zip to /content
 90% 73.0M/80.9M [00:00<00:00, 69.3MB/s]
100% 80.9M/80.9M [00:00<00:00, 104MB/s] 
Archive:  sentiment140.zip
  inflating: training.1600000.processed.noemoticon.csv  


In [11]:
# creating spark session
spark = SparkSession.builder.appName('tweet').getOrCreate()

In [14]:
# reading dataset
data = spark.read.csv('/content/training.1600000.processed.noemoticon.csv', inferSchema=True)
data.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- _c1: long (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)



In [15]:
# changing label name
data = data.withColumnRenamed("_c0", "target").withColumnRenamed("_c1", "id").withColumnRenamed("_c2", "date").withColumnRenamed("_c3", "flag").withColumnRenamed("_c4", "user").withColumnRenamed("_c5", "text")
data.limit(5).show()

+------+----------+--------------------+--------+---------------+--------------------+
|target|        id|                date|    flag|           user|                text|
+------+----------+--------------------+--------+---------------+--------------------+
|     0|1467810369|Mon Apr 06 22:19:...|NO_QUERY|_TheSpecialOne_|@switchfoot http:...|
|     0|1467810672|Mon Apr 06 22:19:...|NO_QUERY|  scotthamilton|is upset that he ...|
|     0|1467810917|Mon Apr 06 22:19:...|NO_QUERY|       mattycus|@Kenichan I dived...|
|     0|1467811184|Mon Apr 06 22:19:...|NO_QUERY|        ElleCTF|my whole body fee...|
|     0|1467811193|Mon Apr 06 22:19:...|NO_QUERY|         Karoli|@nationwideclass ...|
+------+----------+--------------------+--------+---------------+--------------------+



In [16]:
# Showing target information
data.groupBy("target").count().show()

+------+------+
|target| count|
+------+------+
|     4|800000|
|     0|800000|
+------+------+



In [17]:
# Checking null dataset
data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in data.columns]).show()

+------+---+----+----+----+----+
|target| id|date|flag|user|text|
+------+---+----+----+----+----+
|     0|  0|   0|   0|   0|   0|
+------+---+----+----+----+----+



Dataset already used in this modelling, but we need to change target label (0,4) into (0,1).  

In [18]:
# changing target label
data = data.withColumn("target", when(data["target"] == 4,1).otherwise(data["target"]))
data.groupBy("target").count().show()

+------+------+
|target| count|
+------+------+
|     1|800000|
|     0|800000|
+------+------+



In [22]:
# NLP RULE STEP
# 1. cleaning dataset
data_clean = data.select('target', (lower(regexp_replace('text', "[^a-zA-Z\\s]", "")).alias('text')))

# 2. tokenize text
tokenizer = Tokenizer(inputCol="text", outputCol="tokens")
data_token = tokenizer.transform(data_clean).select("target", "tokens")

# 3. remove stop-words
remover = StopWordsRemover(inputCol="tokens", outputCol="clean_word")
data_stop = remover.transform(data_token).select("target", "clean_word")

# 4. stem-text
stemmer = PorterStemmer()
stemmer_udf = udf(lambda tokens: [stemmer.stem(token) for token in tokens], ArrayType(StringType()))
data_stem = data_stop.withColumn("words_stemmed", stemmer_udf("clean_word")).select("target", "words_stemmed")

# 5. filter length word > 5
filter_length_udf = udf(lambda row: [x for x in row if len(x) >= 3], ArrayType(StringType()))
final_words = data_stem.withColumn("words", filter_length_udf(col("words_stemmed")))

# 6. TF-IDF
hashingTF = HashingTF(inputCol="words_stemmed", outputCol="rawFeatures")
featurizedData = hashingTF.transform(final_words)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

In [None]:
# Modelling step
# 1. splitting data
train, test = rescaledData.randomSplit([0.7, 0.3])

# 2. Modelling NB
nb = NaiveBayes(modelType="multinomial",labelCol="target", featuresCol="features")
nbModel = nb.fit(train)
nb_predictions = nbModel.transform(test)
# 2. Modelling LG
lr = LogisticRegression(featuresCol = 'features', labelCol = 'target', maxIter=10)
lrModel = lr.fit(train)
lrPreds = lrModel.transform(test)
# 2. Modelling DT
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'target', maxDepth = 3)
dtModel = dt.fit(train)
dtPreds = dtModel.transform(test)
# 2. Modelling RF
rf = RandomForestClassifier(featuresCol = 'features', labelCol = 'target')
rfModel = rf.fit(train)
rfPreds = rfModel.transform(test)

In [None]:
# 3. Evaluation
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
nb_accuracy = evaluator.evaluate(nb_predictions)
print("Accuracy of NaiveBayes is = %g"% (nb_accuracy))
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
lr_accuracy = evaluator.evaluate(lrPreds)
print("Accuracy of Logistic Regression is = %g"% (lr_accuracy))
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
dt_accuracy = evaluator.evaluate(dtPreds)
print("Accuracy of Decision Trees is = %g"% (dt_accuracy))
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
rf_accuracy = evaluator.evaluate(rfPreds)
print("Accuracy of Random Forests is = %g"% (rf_accuracy))