## uHack Sentiment of Reviews

Dataset Source: https://www.kaggle.com/datasets/mohamedziauddin/mh-uhack-sentiments

#### Import Necessary Libraries

In [0]:
import pyspark

from pyspark.ml import Pipeline
import pyspark.sql.functions as F
from pyspark.sql.types import StringType, StructType, StructField, IntegerType, ArrayType, DoubleType
from pyspark.ml.feature import StringIndexer, IndexToString

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *

#### Display Library Versions

In [0]:
print(f"Apache Spark version:".rjust(24), spark.version)
print("Spark NLP version:".rjust(24), sparknlp.version())

   Apache Spark version: 3.3.0
      Spark NLP version: 4.3.1


#### Function to Ingest Dataset

In [0]:
def ingest_data(file_location: str, 
                schema: StructType, 
                delimiter: str = ',' 
               ) -> pyspark.sql.dataframe.DataFrame:
    '''
    This function reads in the dataset that is passed to it
    and fits the schema that is passed in to the dataset.
    '''
    file_type = "csv"
    infer_schema = "false"
    first_row_is_header = "true"
    
    dataset = spark.read.format(file_type) \
        .option("inferSchema", infer_schema) \
        .option("header", first_row_is_header) \
        .option("sep", delimiter) \
        .schema(schema)\
        .csv(file_location)
    
    return dataset

#### Function to Evaluate Multilabel Classification Models

In [0]:
def evaluate_multilabel_model(dataset: pyspark.sql.dataframe.DataFrame, 
                              metrics: [str], 
                              model_name: str 
                             ) -> None:
    '''
    This function calculates & displays metrics for a multilabel 
    classification analysis.
    '''
    from pyspark.ml.evaluation import MultilabelClassificationEvaluator
    
    print("+---------------------------------------------+")
    print("|  " + model_name.center(41) + "  |")
    print("+---------------------------------------------+")
    print("|   %s  |  %s   |" % ("Metric".rjust(20), "Value".ljust(14)))
    print("+---------------------------------------------+")
    
    for x in metrics:
        evaluator = MultilabelClassificationEvaluator(labelCol="label", \
                                                      predictionCol="prediction", \
                                                      metricName=x) 
        score = evaluator.evaluate(dataset)
        print("|   %s  |  %s   |" % (x.rjust(20), str(round(score, 6)).ljust(14)))
        print("+---------------------------------------------+")

#### Ingest Dataset

In [0]:
data_file = "/FileStore/tables/uHack/train.csv"

orig_schema = StructType([
         StructField("id", StringType(), True),
         StructField('text', StringType(), True),
         StructField('Components', IntegerType(), True),
         StructField('Delivery & Customer Support', IntegerType(), True),
         StructField('Design & Aesthetics', IntegerType(), True),
         StructField('Dimensions', IntegerType(), True),
         StructField('Features', IntegerType(), True),
         StructField('Functionality', IntegerType(), True),
         StructField('Installation', IntegerType(), True),
         StructField('Material', IntegerType(), True),
         StructField('Price', IntegerType(), True),
         StructField('Quality', IntegerType(), True),
         StructField('Usability', IntegerType(), True),
         StructField('Polarity', IntegerType(), True),
         ])

data = ingest_data(data_file, orig_schema)

data = data.drop("id")

data = data.na.drop(how='any')

data = data.dropDuplicates()

display(data)

text,Components,Delivery & Customer Support,Design & Aesthetics,Dimensions,Features,Functionality,Installation,Material,Price,Quality,Usability,Polarity
"I purchased this along with the copper crimp tool set. I needed to salvage/reuse the the water line valves on 1/2 in pex line. the tip does not fit into any 1/2 in valve or fitting. It does work with 1/2 in brass fittings though. I had to get creative and remove the rings with my ryobi multi tool to cut the rings off. Had to be careful not to damage the valve but it worked just fine. removed the crimp rings and was able to re use the water valves, since homedepot did not have the correct size I needed. Still a decent price and it does complete the tool set, will keep it for future projects.",0,0,0,1,0,0,0,0,1,0,0,0
"Works well with my 18 gauge Brad nailer. Easy to load, no jams.",0,0,0,0,0,1,0,0,0,0,1,1
"Biggest problem I had with this was that the charcoal finish is soluble with denatured alcohol.....same thing I use to remove latex paint...so the least little drip of paint when touching up the frames stayed on the screen...or I had to accept a bare spot in the charcoal finish. First time I did a big re-screening.....found I had to be very careful handling the screen, it gets creases very easily (but that might be normal for aluminum screen).Darkens the view more than the old screens, and more than I expected.",0,0,0,0,0,0,0,1,0,1,0,0
"""I like everything about it, great choice of spray patterns, it puts out a large volume of water out of my 1"""" pipes""",0,0,0,0,1,1,0,0,0,0,0,1
These Brad nails worked out great for what I needed it for.,0,0,0,0,0,1,0,0,0,0,0,1
"Took the guy 30 minutes to install new toilets. Love, Love them. My old toilet was 14 inches tall to rim. Really hard to get up from that distance. These are so much taller. Can't say enough good things about these toilets. Would recommend to any one.",0,0,0,0,0,0,0,0,0,1,0,1
Super easy to install and I am not a handy man by any means. It has only been a few days but no issues to this point.,0,0,0,0,0,0,1,0,0,0,0,1
I refinished 4 metal lawn chairs and my pillows fit perfectly!!!,0,0,0,1,0,0,0,0,0,0,0,1
Strong nails with great holding power,0,0,0,0,0,1,0,0,0,1,0,1
frame material is thin and flex's on long sides and screen is loose on sides,0,0,0,0,0,0,0,1,0,1,0,0


#### Convert label Columns into Single Label Column of ArrayType

In [0]:
label_cols = ["Components",
              "Delivery & Customer Support",
              "Design & Aesthetics",
              "Dimensions", 
              "Features",
              "Functionality", 
              "Installation",
              "Material",
              "Price",
              "Quality",
              "Usability",
              "Polarity"]

for x in label_cols:
    data = data.withColumn(x, F.when(F.col(x)==1, F.lit(x)).otherwise(F.concat(F.lit(x), F.lit("Not"))))

cols_to_make_array_type = [data[x] for x in label_cols]

data = data.withColumn("labels", F.array(cols_to_make_array_type).cast(ArrayType(StringType())))

for x in label_cols:
    data = data.drop(x)

display(data)

text,labels
"I purchased this along with the copper crimp tool set. I needed to salvage/reuse the the water line valves on 1/2 in pex line. the tip does not fit into any 1/2 in valve or fitting. It does work with 1/2 in brass fittings though. I had to get creative and remove the rings with my ryobi multi tool to cut the rings off. Had to be careful not to damage the valve but it worked just fine. removed the crimp rings and was able to re use the water valves, since homedepot did not have the correct size I needed. Still a decent price and it does complete the tool set, will keep it for future projects.","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, Dimensions, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Price, QualityNot, UsabilityNot, PolarityNot)"
"Works well with my 18 gauge Brad nailer. Easy to load, no jams.","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, Functionality, InstallationNot, MaterialNot, PriceNot, QualityNot, Usability, Polarity)"
"Biggest problem I had with this was that the charcoal finish is soluble with denatured alcohol.....same thing I use to remove latex paint...so the least little drip of paint when touching up the frames stayed on the screen...or I had to accept a bare spot in the charcoal finish. First time I did a big re-screening.....found I had to be very careful handling the screen, it gets creases very easily (but that might be normal for aluminum screen).Darkens the view more than the old screens, and more than I expected.","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, Material, PriceNot, Quality, UsabilityNot, PolarityNot)"
"""I like everything about it, great choice of spray patterns, it puts out a large volume of water out of my 1"""" pipes""","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, Features, Functionality, InstallationNot, MaterialNot, PriceNot, QualityNot, UsabilityNot, Polarity)"
These Brad nails worked out great for what I needed it for.,"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, Functionality, InstallationNot, MaterialNot, PriceNot, QualityNot, UsabilityNot, Polarity)"
"Took the guy 30 minutes to install new toilets. Love, Love them. My old toilet was 14 inches tall to rim. Really hard to get up from that distance. These are so much taller. Can't say enough good things about these toilets. Would recommend to any one.","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, PriceNot, Quality, UsabilityNot, Polarity)"
Super easy to install and I am not a handy man by any means. It has only been a few days but no issues to this point.,"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, Installation, MaterialNot, PriceNot, QualityNot, UsabilityNot, Polarity)"
I refinished 4 metal lawn chairs and my pillows fit perfectly!!!,"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, Dimensions, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, PriceNot, QualityNot, UsabilityNot, Polarity)"
Strong nails with great holding power,"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, Functionality, InstallationNot, MaterialNot, PriceNot, Quality, UsabilityNot, Polarity)"
frame material is thin and flex's on long sides and screen is loose on sides,"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, Material, PriceNot, Quality, UsabilityNot, PolarityNot)"


#### Split Dataset into Training & Evaluation Datasets

In [0]:
train_ds, test_ds = data.randomSplit([0.80, 0.20], seed=42)

train_ds = train_ds.persist()
test_ds = test_ds.persist()

print(f"There are {train_ds.count()} samples in the training dataset.")
print(f"There are {test_ds.count()} samples in the testing dataset.")

There are 4794 samples in the training dataset.
There are 1222 samples in the testing dataset.


#### Define Pipeline Stages

In [0]:
# document assembler
doc = DocumentAssembler() \
        .setInputCol("text") \
        .setOutputCol("document") \
        .setCleanupMode("shrink")

# Universal Sentence Encoder
use = UniversalSentenceEncoder.pretrained() \
        .setInputCols(["document"]) \
        .setOutputCol("sentences")

# Sentiment Analysis Deep Learning Classifier
ml_clf = MultiClassifierDLApproach() \
        .setInputCols("sentences") \
        .setOutputCol("class") \
        .setLabelColumn("labels") \
        .setBatchSize(16) \
        .setMaxEpochs(9) \
        .setLr(1e-3) \
        .setThreshold(0.5) \
        .setShufflePerEpoch(False) \
        .setEnableOutputLogs(True)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][ / ][ — ][ \ ][ | ][OK!]


#### Build Pipeline

In [0]:
ml_pipe = Pipeline().setStages([
    doc,
    use,
    ml_clf
])

#### Fit/Train Model

In [0]:
ml_model = ml_pipe.fit(train_ds)

#### Generate Predictions Using Testing Dataset

In [0]:
predictions = ml_model.transform(test_ds)

#### Prepare Predictions for Metrics Evaluation (Part 1)

In [0]:
preds = predictions.select(F.col('labels').alias("label"),
                                F.col('class.result').cast(ArrayType(StringType())).alias("prediction"))

train_ds = train_ds.unpersist()
test_ds = test_ds.unpersist()

preds = preds.persist()

preds = preds.withColumn("label", F.array_sort(F.col("label")).cast(ArrayType(StringType()))) \
            .withColumn("prediction", F.array_sort(F.col("prediction").cast(ArrayType(StringType()))))

display(preds)

label,prediction
"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, Material, PolarityNot, PriceNot, Quality, UsabilityNot)","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, PolarityNot, PriceNot, Quality, UsabilityNot)"
"List(ComponentsNot, Delivery & Customer SupportNot, Design & Aesthetics, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, PriceNot, Quality, Usability)","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, PriceNot, QualityNot, UsabilityNot)"
"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, Functionality, InstallationNot, MaterialNot, Polarity, PriceNot, Quality, UsabilityNot)","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, Installation, MaterialNot, Polarity, PriceNot, Quality, UsabilityNot)"
"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, PriceNot, Quality, UsabilityNot)","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, PriceNot, Quality, UsabilityNot)"
"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, Price, QualityNot, UsabilityNot)","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, Price, QualityNot, UsabilityNot)"
"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, Installation, MaterialNot, Polarity, PriceNot, QualityNot, UsabilityNot)","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, Installation, MaterialNot, Polarity, PriceNot, QualityNot, UsabilityNot)"
"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, Features, Functionality, InstallationNot, MaterialNot, Polarity, PriceNot, QualityNot, UsabilityNot)","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, Dimensions, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, PriceNot, QualityNot, Usability)"
"List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, Price, QualityNot, UsabilityNot)","List(ComponentsNot, Delivery & Customer SupportNot, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, Price, QualityNot, UsabilityNot)"
"List(ComponentsNot, Delivery & Customer SupportNot, Design & Aesthetics, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, PriceNot, QualityNot, UsabilityNot)","List(ComponentsNot, Delivery & Customer SupportNot, Design & Aesthetics, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, PriceNot, QualityNot, UsabilityNot)"
"List(ComponentsNot, Delivery & Customer Support, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, Price, Quality, UsabilityNot)","List(ComponentsNot, Delivery & Customer Support, Design & AestheticsNot, DimensionsNot, FeaturesNot, FunctionalityNot, InstallationNot, MaterialNot, Polarity, Price, Quality, UsabilityNot)"


#### Prepare Predictions for Metrics Evaluation (Part 2)

In [0]:
### Dictionary to convert the ones
convert_to_ones = {}

for x in label_cols:
    key = x
    value = 1
    convert_to_ones[key] = value 
print(convert_to_ones) 

### Dictionary to convert the zeros
not_label_cols = [x + "Not" for x in label_cols]

convert_to_zeros = {}

for x in not_label_cols:
    key = x
    value = 0
    convert_to_zeros[key] = value 
print(convert_to_zeros) 

{'Components': 1, 'Delivery & Customer Support': 1, 'Design & Aesthetics': 1, 'Dimensions': 1, 'Features': 1, 'Functionality': 1, 'Installation': 1, 'Material': 1, 'Price': 1, 'Quality': 1, 'Usability': 1, 'Polarity': 1}
{'ComponentsNot': 0, 'Delivery & Customer SupportNot': 0, 'Design & AestheticsNot': 0, 'DimensionsNot': 0, 'FeaturesNot': 0, 'FunctionalityNot': 0, 'InstallationNot': 0, 'MaterialNot': 0, 'PriceNot': 0, 'QualityNot': 0, 'UsabilityNot': 0, 'PolarityNot': 0}


#### Prepare Predictions for Metrics Evaluation (Part 3)

In [0]:
### For the 'label' Column

# Convert values in 'label' column that end with "Not" to "0"
def replace_with_zeros(x):
    return [convert_to_zeros.get(i,i) for i in x]
zero_converter = F.udf(replace_with_zeros)
preds = preds.withColumn("label", zero_converter(F.col("label")))

# Convert values in 'label' column that do not end with "Not" to "1"
def replace_with_ones(x):
    return [convert_to_ones.get(i,i) for i in x]
one_converter = F.udf(replace_with_ones)
preds = preds.withColumn("label", one_converter(F.col("label")))

### For the 'prediction' Column

# Convert values in 'prediction' column that end with "Not" to "0"
def replace_with_zeros(x):
    return [convert_to_zeros.get(i,i) for i in x]
zero_converter = F.udf(replace_with_zeros)
preds = preds.withColumn("prediction", zero_converter(F.col("prediction")))

# Convert values in 'prediction' column that do not end with "Not" to "1"
def replace_with_ones(x):
    return [convert_to_ones.get(i,i) for i in x]
one_converter = F.udf(replace_with_ones)
preds = preds.withColumn("prediction", one_converter(F.col("prediction")))

display(preds)

label,prediction
"[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]"
"[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1]","[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]"
"[0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0]"
"[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0]"
"[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]"
"[0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0]"
"[0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0]","[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1]"
"[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0]"
"[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]","[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]"
"[0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]"


#### Prepare Predictions for Metrics Evaluation (Part 4)

In [0]:
preds = preds.withColumn("label", F.split(F.col("label"),","))\
            .withColumn("prediction", F.split(F.col("prediction"),","))

preds = preds.withColumn("label", F.col("label").cast(ArrayType(DoubleType())))\
            .withColumn("prediction", F.col("prediction").cast(ArrayType(DoubleType())))

#### Metrics Evaluation

In [0]:
metrics_to_eval = ["accuracy", "f1Measure", 
                    "precision", "recall", 
                    "microPrecision", "microRecall", 
                    "microF1Measure", "subsetAccuracy", 
                    "hammingLoss"]

evaluate_multilabel_model(preds, 
                          metrics_to_eval, 
                          "Multi-Label of uHack Review Sentiments")

+---------------------------------------------+
|    Multi-Label of uHack Review Sentiments   |
+---------------------------------------------+
|                 Metric  |  Value            |
+---------------------------------------------+
|               accuracy  |  0.912601         |
+---------------------------------------------+
|              f1Measure  |  0.951605         |
+---------------------------------------------+
|              precision  |  0.951306         |
+---------------------------------------------+
|                 recall  |  0.952059         |
+---------------------------------------------+
|         microPrecision  |  0.951151         |
+---------------------------------------------+
|            microRecall  |  0.952059         |
+---------------------------------------------+
|         microF1Measure  |  0.951605         |
+---------------------------------------------+
|         subsetAccuracy  |  0.388707         |
+---------------------------------------

### Notes & Other Takeaways From This Project
****
- I was hoping for better results. While the F1 Measure here is better (0.9516 vs 0.8697 with DistilBERT via HuggingFace Trainer API), the subset accuracy here is lower than with the HuggingFace model (0.3887 vs 0.5787 with DistilBERT via HuggingFace Trainer API).
****