# Disaster Tweets Classification By Using SparkNLP
In this project, I am challenged to build a classification model that predicts which Tweets are about real disasters and which one’s aren’t. <br/>
I have access to a dataset of 10,000 tweets that were hand classified. 

In [1]:
#Importing Useful Packages
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, accuracy_score

import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline

In [2]:
#Starting Sparknlp
spark= sparknlp.start()

In [3]:
print("SparkNLP version: {}".format(sparknlp.version()))
print("Pyspark version: {}".format(spark.version))

SparkNLP version: 2.6.1
Pyspark version: 2.4.4


In [4]:
# Loading train and test datasets
df_train= spark.read\
    .option("header", True)\
    .csv("train.csv")
df_test= spark.read\
    .option("header", True)\
    .csv("test.csv")
submission= spark.read\
    .option("header", True)\
    .csv("sample_submission.csv")

In [5]:
df_train.show(5, truncate=False)  

+---+-------+--------+-------------------------------------------------------------------------------------------------------------------------------------+------+
|id |keyword|location|text                                                                                                                                 |target|
+---+-------+--------+-------------------------------------------------------------------------------------------------------------------------------------+------+
|1  |null   |null    |Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all                                                                |1     |
|4  |null   |null    |Forest fire near La Ronge Sask. Canada                                                                                               |1     |
|5  |null   |null    |All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected|1     |
|6  |null   |nul

**id:** This column consist ids per each tweets <br/>
**keyword:** A keyword from that tweet (although this may be blank) <br/>
**location:** The location the tweet was sent from (may also be blank) <br/>
**text:** The text of a tweet <br/>
**target:** In train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0) <br/>

In [6]:
df_train.groupby("target")\
    .count()\
    .orderBy(col("count")).show()

+------+-----+
|target|count|
+------+-----+
|  null| 1211|
|     1| 3081|
|     0| 4095|
+------+-----+



Data include some null values. I will drop them. Also, firstly I will drop keyword and location columns.

In [43]:
drop_col= ["keyword", "location"]
df_train= df_train.drop(*drop_col)
df_train.show()

+---+--------------------+------+
| id|                text|target|
+---+--------------------+------+
| 48|@bbcmtd Wholesale...|     1|
| 49|We always try to ...|     0|
| 50|#AFRICANBAZE: Bre...|     1|
| 52|Crying out for mo...|     0|
| 53|On plus side LOOK...|     0|
| 54|@PhDSquares #mufc...|     0|
| 55|INEC Office in Ab...|     1|
| 57|Ablaze for you Lo...|     0|
| 59|Check these out: ...|     0|
| 62|Had an awesome ti...|     0|
| 66|How the West was ...|     1|
| 68|Check these out: ...|     0|
| 71|First night with ...|     0|
| 73|Deputies: Man sho...|     1|
| 76|SANTA CRUZ ÛÓ He...|     0|
| 77|Police: Arsonist ...|     1|
| 78|Noches El-Bestia ...|     0|
| 79|#Kurds trampling ...|     1|
| 80|TRUCK ABLAZE : R2...|     1|
| 81|Set our hearts ab...|     0|
+---+--------------------+------+
only showing top 20 rows



In [44]:
#Dropping the values which is null in the target column.
df_train= df_train.na.drop(how="any")
df_train.groupby("target")\
    .count()\
    .orderBy(col("count")).show()

+------+-----+
|target|count|
+------+-----+
|     1| 2062|
|     0| 2709|
+------+-----+



Now, there aren't any null values. <br/>
Firstly I will apply SparkNLP DocumentAssembler. DocumentAssembler is a entry point to SparkNLP pipeline. <br/>
After thar, I will apply Universal Sentence Encoder and then create ClassifierDL. <br/>
Finally I will put into the pipeline and fit with the train_set.

In [28]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained()\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("target")\
    .setMaxEpochs(10)\
    .setEnableOutputLogs(True)\
    .setLr(0.004)\

nlpPipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923,7 MB
[OK!]


In [45]:
#splitting the data into train set and test set
(train_set, test_set)= df_train.randomSplit([0.8, 0.2], seed=100)
print("Train set shape: {}".format((train_set.count(), len(train_set.columns))))
print("Test set shape: {}".format((test_set.count(), len(test_set.columns))))

Train set shape: (3834, 3)
Test set shape: (937, 3)


In [32]:
#fitting with train_set
use_model = nlpPipeline.fit(train_set)

When we fit pipeline, Spark NLP will write the training logs to "annotator_logs" folder in our home directory. <br/>
Here is how you can read the logs:

In [33]:
!cd ~/annotator_logs && ls -l

total 48
-rw-r--r--  1 ahmetemintek  staff   447 26 Eyl 12:55 ClassifierDLApproach_70b0c4c2a95a.log
-rw-r--r--  1 ahmetemintek  staff   445 27 Eyl 12:31 ClassifierDLApproach_94dd5e89b5b4.log
-rw-r--r--  1 ahmetemintek  staff   797 27 Eyl 12:45 ClassifierDLApproach_a15f773cd2bd.log
-rw-r--r--  1 ahmetemintek  staff   889 27 Eyl 12:42 ClassifierDLApproach_d968d7735acf.log
-rw-r--r--  1 ahmetemintek  staff   445 27 Eyl 01:46 ClassifierDLApproach_f3c2764d4358.log
-rw-r--r--  1 ahmetemintek  staff  1579 27 Eyl 12:47 ClassifierDLApproach_f8faa19c5d43.log


For check the result of our model:

In [35]:
!cat ~/annotator_logs/ClassifierDLApproach_f8faa19c5d43.log

Training started - epochs: 10 - learning_rate: 0.004 - batch_size: 64 - training_examples: 4771 - classes: 2
Epoch 0/10 - 1,70s - loss: 31.218817 - acc: 0.7941361 - batches: 75
Epoch 1/10 - 1,23s - loss: 30.818155 - acc: 0.8378439 - batches: 75
Epoch 2/10 - 1,19s - loss: 31.518713 - acc: 0.8504766 - batches: 75
Epoch 3/10 - 1,16s - loss: 31.563896 - acc: 0.8565999 - batches: 75
Epoch 4/10 - 1,17s - loss: 31.516844 - acc: 0.86272323 - batches: 75
Epoch 5/10 - 1,17s - loss: 31.103827 - acc: 0.8684242 - batches: 75
Epoch 6/10 - 1,19s - loss: 30.753832 - acc: 0.87264717 - batches: 75
Epoch 7/10 - 1,17s - loss: 30.671005 - acc: 0.8762005 - batches: 75
Epoch 8/10 - 1,18s - loss: 30.478905 - acc: 0.8785231 - batches: 75
Epoch 9/10 - 1,18s - loss: 30.123034 - acc: 0.8795789 - batches: 75
Training started - epochs: 10 - learning_rate: 0.004 - batch_size: 64 - training_examples: 3834 - classes: 2
Epoch 0/10 - 1,49s - loss: 30.006615 - acc: 0.7987745 - batches: 60
Epoch 1/10 - 0,94s - loss: 26.76

We achieved %87 accuracy score on train_set. <br/>
Let's check the model with test_set by using sklearn metrics.

In [36]:
preds= use_model.transform(test_set)
preds.select("target", "text", "class.result").show(5, truncate=False)

+------+----------------------------------------------------------------------------------------------------------------------------------------+------+
|target|text                                                                                                                                    |result|
+------+----------------------------------------------------------------------------------------------------------------------------------------+------+
|1     |Gail and Russell saw lots of hail at their Dalroy home - they have video of twister 1/2 mile from their home #yyc http://t.co/3VfKEdGrsO|[1]   |
|0     |Crazy Mom Threw Teen Daughter a NUDE Twister Sex Party According To Her Friend59 more pics http://t.co/t94LNfwf34 http://t.co/roCyyEI2dM|[0]   |
|0     |The Sharper Image Viper 24' Hardside Twister (Black) http://t.co/FXk3zsj2PE                                                             |[0]   |
|0     |Why Some Traffic Is Freezing Cold And Some Blazing Hot ÛÒ And How To Heat

In [37]:
df= use_model.transform(test_set).select("target", "document", "class.result").toPandas()
df["result"]= df["result"].apply(lambda x: x[0])
print(classification_report(df["target"], df["result"]))
print(accuracy_score(df["target"], df["result"]))

              precision    recall  f1-score   support

           0       0.81      0.88      0.84       539
           1       0.82      0.72      0.77       398

    accuracy                           0.81       937
   macro avg       0.81      0.80      0.81       937
weighted avg       0.81      0.81      0.81       937

0.8132337246531484


We achieved %81 accuracy score on test_set as well. </br>
####  Applying the model on the test.csv data for submission.

In [96]:
df_test= spark.read\
    .option("header", True)\
    .csv("test.csv")

In [117]:
df_test.show(5, truncate=False)  #this is the test data

+---+-------+--------+------------------------------------------------------------------------------------------------+
|id |keyword|location|text                                                                                            |
+---+-------+--------+------------------------------------------------------------------------------------------------+
|0  |null   |null    |Just happened a terrible car crash                                                              |
|2  |null   |null    |Heard about #earthquake is different cities, stay safe everyone.                                |
|3  |null   |null    |there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all|
|9  |null   |null    |Apocalypse lighting. #Spokane #wildfires                                                        |
|11 |null   |null    |Typhoon Soudelor kills 28 in China and Taiwan                                                   |
+---+-------+--------+------------------

In [41]:
submission.show(5, truncate=False) #this is the submission format

+---+------+
|id |target|
+---+------+
|0  |0     |
|2  |0     |
|3  |0     |
|9  |0     |
|11 |0     |
+---+------+
only showing top 5 rows



In [118]:
preds= use_model.transform(df_test)
preds.select("id","text", "class.result").show(5, truncate=False)

+---+------------------------------------------------------------------------------------------------+------+
|id |text                                                                                            |result|
+---+------------------------------------------------------------------------------------------------+------+
|0  |Just happened a terrible car crash                                                              |[1]   |
|2  |Heard about #earthquake is different cities, stay safe everyone.                                |[1]   |
|3  |there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all|[1]   |
|9  |Apocalypse lighting. #Spokane #wildfires                                                        |[1]   |
|11 |Typhoon Soudelor kills 28 in China and Taiwan                                                   |[1]   |
+---+------------------------------------------------------------------------------------------------+------+
only showi

In [130]:
final = use_model.transform(df_test).select("id", "class.result")

In [137]:
# This is the final step
final= final.withColumnRenamed("result" ,"target")
final.show(5)

+---+------+
| id|target|
+---+------+
|  0|   [1]|
|  2|   [1]|
|  3|   [1]|
|  9|   [1]|
| 11|   [1]|
+---+------+
only showing top 5 rows

