# Disaster Tweets Classification By Using SparkNLP
In this project, I built a classification model that predicts which Tweets are about real disasters and which one’s aren’t. <br/>
I have accessed to a dataset that contains 10,000 tweets. 

In [None]:
!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/colab_setup.sh -O - | bash

--2021-10-13 09:25:33--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1608 (1.6K) [text/plain]
Saving to: ‘STDOUT’


2021-10-13 09:25:33 (31.8 MB/s) - written to stdout [1608/1608]

setup Colab for PySpark 3.0.2 and Spark NLP 3.1.0
Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Ign:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:4 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:6 https

In [None]:
#Importing Useful Packages
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, accuracy_score

import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline

In [None]:
#Starting Sparknlp
spark= sparknlp.start()

In [None]:
print("SparkNLP version: {}".format(sparknlp.version()))
print("Pyspark version: {}".format(spark.version))

SparkNLP version: 3.1.0
Pyspark version: 3.0.2


In [None]:
# Loading train and test datasets
df_train= spark.read\
    .option("header", True)\
    .csv("train.csv")
df_test= spark.read\
    .option("header", True)\
    .csv("test.csv")


In [None]:
df_train.show(5, truncate=False)  

+---+-------+--------+-------------------------------------------------------------------------------------------------------------------------------------+------+
|id |keyword|location|text                                                                                                                                 |target|
+---+-------+--------+-------------------------------------------------------------------------------------------------------------------------------------+------+
|1  |null   |null    |Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all                                                                |1     |
|4  |null   |null    |Forest fire near La Ronge Sask. Canada                                                                                               |1     |
|5  |null   |null    |All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected|1     |
|6  |null   |nul

**id:** This column consist ids per each tweets <br/>
**keyword:** A keyword from that tweet (although this may be blank) <br/>
**location:** The location the tweet was sent from (may also be blank) <br/>
**text:** The text of a tweet <br/>
**target:** In train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0) <br/>

In [None]:
df_train.groupby("target")\
    .count()\
    .orderBy(col("count")).show()

+------+-----+
|target|count|
+------+-----+
|  null| 1211|
|     1| 3081|
|     0| 4095|
+------+-----+



Data include some null values. I will drop them. Also, firstly I will drop keyword and location columns.

In [None]:
drop_col= ["keyword", "location"]
df_train= df_train.drop(*drop_col)
df_train.show()

+---+--------------------+------+
| id|                text|target|
+---+--------------------+------+
|  1|Our Deeds are the...|     1|
|  4|Forest fire near ...|     1|
|  5|All residents ask...|     1|
|  6|13,000 people rec...|     1|
|  7|Just got sent thi...|     1|
|  8|#RockyFire Update...|     1|
| 10|#flood #disaster ...|     1|
| 13|I'm on top of the...|     1|
| 14|There's an emerge...|     1|
| 15|I'm afraid that t...|     1|
| 16|Three people died...|     1|
| 17|Haha South Tampa ...|     1|
| 18|#raining #floodin...|     1|
| 19|#Flood in Bago My...|     1|
| 20|Damage to school ...|     1|
| 23|      What's up man?|     0|
| 24|       I love fruits|     0|
| 25|    Summer is lovely|     0|
| 26|   My car is so fast|     0|
| 28|What a goooooooaa...|     0|
+---+--------------------+------+
only showing top 20 rows



In [None]:
#Dropping the values which is null in the target column.
df_train= df_train.na.drop(how="any")
df_train.groupby("target")\
    .count()\
    .orderBy(col("count")).show()

+------+-----+
|target|count|
+------+-----+
|     1| 3081|
|     0| 4095|
+------+-----+



Now, there aren't any null values. <br/>
Firstly I will apply SparkNLP DocumentAssembler. DocumentAssembler is a entry point to SparkNLP pipeline. <br/>
After thar, I will apply Universal Sentence Encoder and then create ClassifierDL. <br/>
Finally I will put into the pipeline and fit with the train_set.

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained()\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("target")\
    .setMaxEpochs(10)\
    .setEnableOutputLogs(True)\
    .setLr(0.004)\

nlpPipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
#splitting the data into train set and test set
(train_set, test_set)= df_train.randomSplit([0.8, 0.2], seed=100)
print("Train set shape: {}".format((train_set.count(), len(train_set.columns))))
print("Test set shape: {}".format((test_set.count(), len(test_set.columns))))

Train set shape: (5744, 3)
Test set shape: (1432, 3)


In [None]:
#fitting with train_set
use_model = nlpPipeline.fit(train_set)

When we fit pipeline, Spark NLP will write the training logs to "annotator_logs" folder in our home directory. <br/>
Here is how you can read the logs:

In [None]:
!cd ~/annotator_logs && ls -l

total 4
-rw-r--r-- 1 root root 788 Oct 13 09:33 ClassifierDLApproach_f70d46c11503.log


For check the result of our model:

In [None]:
!cat ~/annotator_logs/ClassifierDLApproach_f70d46c11503.log

Training started - epochs: 10 - learning_rate: 0.004 - batch_size: 64 - training_examples: 5744 - classes: 2
Epoch 0/10 - 1.65s - loss: 42.825516 - acc: 0.81115407 - batches: 90
Epoch 1/10 - 1.37s - loss: 39.700607 - acc: 0.8449789 - batches: 90
Epoch 2/10 - 1.65s - loss: 37.751377 - acc: 0.85750234 - batches: 90
Epoch 3/10 - 1.36s - loss: 37.139404 - acc: 0.8644077 - batches: 90
Epoch 4/10 - 1.39s - loss: 36.602478 - acc: 0.8703768 - batches: 90
Epoch 5/10 - 1.37s - loss: 36.50698 - acc: 0.8719569 - batches: 90
Epoch 6/10 - 1.34s - loss: 36.32841 - acc: 0.87845266 - batches: 90
Epoch 7/10 - 1.45s - loss: 36.23835 - acc: 0.88126165 - batches: 90
Epoch 8/10 - 1.41s - loss: 36.23583 - acc: 0.88448036 - batches: 90
Epoch 9/10 - 1.35s - loss: 36.2266 - acc: 0.8860604 - batches: 90


We achieved %87 accuracy score on train_set. <br/>
Let's check the model with test_set by using sklearn metrics.

In [None]:
preds= use_model.transform(test_set)
preds.select("target", "text", "class.result").show(5, truncate=False)

+------+----------------------------------------------------------------------------------------------------------------------------------------+------+
|target|text                                                                                                                                    |result|
+------+----------------------------------------------------------------------------------------------------------------------------------------+------+
|0     |I liked a @YouTube video http://t.co/0h7OUa1pns Call of Duty: Ghosts - Campanha - EP 6 'Tsunami'                                        |[0]   |
|1     |Gail and Russell saw lots of hail at their Dalroy home - they have video of twister 1/2 mile from their home #yyc http://t.co/3VfKEdGrsO|[1]   |
|0     |love 106.1 The Twister @1061thetwister  and Maddie and Tae #OKTXDUO                                                                     |[0]   |
|0     |Brain twister let drop up telly structuring cast: EDcXO                   

In [None]:
df= use_model.transform(test_set).select("target", "document", "class.result").toPandas()
df["result"]= df["result"].apply(lambda x: x[0])
print(classification_report(df["target"], df["result"]))
print(accuracy_score(df["target"], df["result"]))

              precision    recall  f1-score   support

           0       0.80      0.89      0.84       813
           1       0.83      0.70      0.76       619

    accuracy                           0.81      1432
   macro avg       0.81      0.80      0.80      1432
weighted avg       0.81      0.81      0.81      1432

0.8079608938547486


We achieved %81 accuracy score on test_set as well. </br>
####  Applying the model on the test.csv data for submission.

In [None]:
df_test= spark.read\
    .option("header", True)\
    .csv("test.csv")

In [None]:
df_test.show(5, truncate=False)  #this is the test data

+---+-------+--------+------------------------------------------------------------------------------------------------+
|id |keyword|location|text                                                                                            |
+---+-------+--------+------------------------------------------------------------------------------------------------+
|0  |null   |null    |Just happened a terrible car crash                                                              |
|2  |null   |null    |Heard about #earthquake is different cities, stay safe everyone.                                |
|3  |null   |null    |there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all|
|9  |null   |null    |Apocalypse lighting. #Spokane #wildfires                                                        |
|11 |null   |null    |Typhoon Soudelor kills 28 in China and Taiwan                                                   |
+---+-------+--------+------------------

In [None]:
submission.show(5, truncate=False) #this is the submission format

+---+------+
|id |target|
+---+------+
|0  |0     |
|2  |0     |
|3  |0     |
|9  |0     |
|11 |0     |
+---+------+
only showing top 5 rows



In [None]:
preds= use_model.transform(df_test)
preds.select("id","text", "class.result").show(5, truncate=False)

+---+------------------------------------------------------------------------------------------------+------+
|id |text                                                                                            |result|
+---+------------------------------------------------------------------------------------------------+------+
|0  |Just happened a terrible car crash                                                              |[1]   |
|2  |Heard about #earthquake is different cities, stay safe everyone.                                |[1]   |
|3  |there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all|[1]   |
|9  |Apocalypse lighting. #Spokane #wildfires                                                        |[1]   |
|11 |Typhoon Soudelor kills 28 in China and Taiwan                                                   |[1]   |
+---+------------------------------------------------------------------------------------------------+------+
only showi

In [None]:
final = use_model.transform(df_test).select("id", "class.result")

In [None]:
# This is the final step
final= final.withColumnRenamed("result" ,"target")
final.show(5)

+---+------+
| id|target|
+---+------+
|  0|   [1]|
|  2|   [1]|
|  3|   [1]|
|  9|   [1]|
| 11|   [1]|
+---+------+
only showing top 5 rows



In [None]:
final= final.toPandas()

In [None]:
final.to_csv(index=False, path_or_buf="/content/final_submission.csv")

id        3613
target    3613
dtype: int64