# Download Labrary

In [84]:
# This is only to setup PySpark and Spark NLP on Colab
!wget http://setup.johnsnowlabs.com/colab.sh -O - | bash

--2023-08-23 18:47:50--  http://setup.johnsnowlabs.com/colab.sh
Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125
Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh [following]
--2023-08-23 18:47:50--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/scripts/colab_setup.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1191 (1.2K) [text/plain]
Saving to: ‘STDOUT’

-                     0%[                    ]       0  --.-KB/s               Installing PySpark 3.2.3 and Spark NLP 5.0.2
setup Colab for PySpark 3.2.3 an

# Import Library

In [85]:
#Importing Useful Packages
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, accuracy_score

import sparknlp
from sparknlp.annotator import *
from sparknlp.base import *
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline

# Use Spark NLP to read dataset



In [86]:
#Starting Sparknlp
spark= sparknlp.start()



In [87]:
print("SparkNLP version: {}".format(sparknlp.version()))
print("Pyspark version: {}".format(spark.version))

SparkNLP version: 5.0.2
Pyspark version: 3.2.3


## About Dataset
Context:
The file contains over 11,000 tweets associated with disaster keywords like “crash”, “quarantine”, and “bush fires” as well as the location and keyword itself. The data structure was inherited from Disasters on social media

The tweets were collected on Jan 14th, 2020.

Some of the topics people were tweeting:
1. The eruption of Taal Volcano in Batangas, Philippines
2. Coronavirus
3. Bushfires in Australia
4. Iran downing of the airplane flight PS752

Disclaimer: The dataset contains text that may be considered profane, vulgar, or offensive.

Dataset Link:
https://www.kaggle.com/datasets/vstepanenko/disaster-tweets

In [88]:
# Load the combined dataset
df_combined = spark.read\
    .option("header", True)\
    .option("quote", "\"")\
    .option("escape", "\"")\
    .csv("tweets.csv")



In [89]:
df_combined.show(10, truncate=False)

+---+-------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------+------+
|id |keyword|location       |text                                                                                                                                        |target|
+---+-------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------+------+
|0  |ablaze |null           |Communal violence in Bhainsa, Telangana. "Stones were pelted on Muslims' houses and some houses and vehicles were set ablaze…               |1     |
|1  |ablaze |null           |Telangana: Section 144 has been imposed in Bhainsa from January 13 to 15, after clash erupted between two groups on January 12. Po…         |1     |
|2  |ablaze |New York City  |Arsonist sets cars ablaze at dealership https://t.co/gOQvyJbpVI                  

Drop the "keyword" and "location" columns

In [90]:
drop_col= ["keyword", "location"]
df_combined= df_combined.drop(*drop_col)
df_combined.show()

+---+--------------------+------+
| id|                text|target|
+---+--------------------+------+
|  0|Communal violence...|     1|
|  1|Telangana: Sectio...|     1|
|  2|Arsonist sets car...|     1|
|  3|Arsonist sets car...|     1|
|  4|"Lord Jesus, your...|     0|
|  5|If this child was...|     0|
|  6|Several houses ha...|     1|
|  7|Asansol: A BJP of...|     1|
|  8|National Security...|     0|
|  9|This creature who...|     0|
| 10|Images showing th...|     1|
| 11|Social media went...|     0|
| 12|Hausa youths set ...|     1|
| 13|Under #MamataBane...|     1|
| 14|AMEN! Set the who...|     0|
| 15|Images showing th...|     1|
| 16|No cows today but...|     1|
| 17|Rengoku sets my h...|     0|
| 18|paulzizkaphoto: “...|     0|
| 19|French cameroun s...|     1|
+---+--------------------+------+
only showing top 20 rows



Data include some null values. I will drop them.

In [91]:
df_combined.groupby("target")\
    .count()\
    .orderBy(col("count")).show()

+------+-----+
|target|count|
+------+-----+
|  null|   13|
|     1| 2113|
|     0| 9251|
+------+-----+



In [92]:
# Dropping rows with null values in the "target" column
df_combined = df_combined.na.drop(subset=["target"])

# Show the DataFrame after dropping null values
df_combined.groupby("target")\
    .count()\
    .orderBy(col("count")).show()


+------+-----+
|target|count|
+------+-----+
|     1| 2113|
|     0| 9251|
+------+-----+



# Split Data
we're using the sample function with the parameters withReplacement=False, fraction=0.8, and seed=123 to create the train_data DataFrame, which will contain approximately 80% of the data for training. Then, we're using the subtract function to create the test_data DataFrame, which will contain the remaining data for testing.

In [93]:
train_data = df_combined.sample(False, 0.8, seed=123)  # 80% for training
test_data = df_combined.subtract(train_data)  # Remaining data for testing


In [94]:
train_data.show(5, truncate=False)
test_data.show(5, truncate=False)

+---+--------------------------------------------------------------------------------------------------------------------------------------------+------+
|id |text                                                                                                                                        |target|
+---+--------------------------------------------------------------------------------------------------------------------------------------------+------+
|0  |Communal violence in Bhainsa, Telangana. "Stones were pelted on Muslims' houses and some houses and vehicles were set ablaze…               |1     |
|1  |Telangana: Section 144 has been imposed in Bhainsa from January 13 to 15, after clash erupted between two groups on January 12. Po…         |1     |
|3  |Arsonist sets cars ablaze at dealership https://t.co/0gL7NUCPlb https://t.co/u1CcBhOWh9                                                     |1     |
|4  |"Lord Jesus, your love brings freedom and pardon. Fill me with your Hol

In [95]:
print("Train data shape: {}".format((train_data.count(), len(train_data.columns))))
print("Test data shape: {}".format((test_data.count(), len(test_data.columns))))

Train data shape: (9053, 3)
Test data shape: (2311, 3)


# Spark NLP Model Set Up
Firstly I will apply SparkNLP DocumentAssembler. DocumentAssembler is a entry point to SparkNLP pipeline.
After thar, I will apply Universal Sentence Encoder and then create ClassifierDL.
Finally I will put into the pipeline and fit with the train_set.

1. **DocumentAssembler:**
   The `DocumentAssembler` is used to assemble the input text data into a format suitable for further processing. It's responsible for converting the input text data into a structured document format that can be fed into subsequent NLP components.

2. **UniversalSentenceEncoder:**
   The `UniversalSentenceEncoder` is a pre-trained sentence embedding model that converts text sentences into dense vector representations (embeddings). These embeddings capture the semantic meaning of the sentences and are useful for downstream tasks like classification.

3. **ClassifierDLApproach:**
   The `ClassifierDLApproach` is a Deep Learning-based text classification approach. It takes the sentence embeddings generated by the `UniversalSentenceEncoder` and trains a classification model. In this case, the target column, which likely contains the target labels (0, 1), is used as the label column for training.

   - `setInputCols(["sentence_embeddings"])`: Specifies that the input to the classifier model will be the sentence embeddings generated by the `UniversalSentenceEncoder`.
   - `setOutputCol("class")`: Sets the output column name for the classification results.
   - `setLabelColumn("target")`: Specifies the column containing the target labels for supervised training.
   - `setMaxEpochs(10)`: Sets the maximum number of training epochs.
   - `setEnableOutputLogs(True)`: Enables logging of training progress.
   - `setLr(0.004)`: Sets the learning rate for training.

4. **Pipeline:**
   A `Pipeline` is created to define the sequence of stages for processing the data. It includes the `DocumentAssembler`, `UniversalSentenceEncoder`, and `ClassifierDLApproach` stages.


In [96]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained()\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

classsifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("target")\
    .setMaxEpochs(10)\
    .setEnableOutputLogs(True)\
    .setLr(0.004)\

nlpPipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


## Train Model

In [97]:
#fitting with train_set
model = nlpPipeline.fit(train_data)

When we fit pipeline, Spark NLP will write the training logs to "annotator_logs" folder in our home directory.
Here is how you can read the logs:

In [98]:
!cd ~/annotator_logs && ls -l

total 8
-rw-r--r-- 1 root root  789 Aug 23 18:49 ClassifierDLApproach_08885d9bab9e.log
-rw-r--r-- 1 root root 3954 Aug 23 18:36 ClassifierDLApproach_319350138664.log


For check the result of our model:

In [99]:
!cat ~/annotator_logs/ClassifierDLApproach_e0b7218e9b57.log

cat: /root/annotator_logs/ClassifierDLApproach_e0b7218e9b57.log: No such file or directory


See the prediction results:

In [100]:
preds= model.transform(test_data)
preds.select("target", "text", "class.result").show(5, truncate=False)

+------+-------------------------------------------------------------------------------------------------------------------------------------+------+
|target|text                                                                                                                                 |result|
+------+-------------------------------------------------------------------------------------------------------------------------------------+------+
|0     |HE HAS A FUCKING HEART AND A LITTLE RAINBOW EMBROIDERED ONTO HIS PANTS IM GOING TO COMMIT ARSON https://t.co/hF30TdQGpy              |[0]   |
|1     |Violence, arson across West Bengal as strikers try to enforce bandh; 55 arrested in Kolkata - Times of India… https://t.co/bIMdUMDstT|[0]   |
|0     |When an eagle soars into the blazing sun, it is not just a show of strength but because he truly belongs at the top. You…            |[0]   |
|0     |“Life naturally involves conflicting interests; people have their own agendas, and they coll

## Pediction Report

In [101]:
df_pred= preds.select("target", "document", "class.result").toPandas()
df_pred["result"]= df_pred["result"].apply(lambda x: x[0])
print(classification_report(df_pred["target"], df_pred["result"]))
print(accuracy_score(df_pred["target"], df_pred["result"]))

              precision    recall  f1-score   support

           0       0.81      1.00      0.89      1868
           1       0.00      0.00      0.00       443

    accuracy                           0.81      2311
   macro avg       0.40      0.50      0.45      2311
weighted avg       0.65      0.81      0.72      2311

0.8083080917351796


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
