In [1]:
!pip install pyspark==3.4.1
!pip install spark-nlp==5.2.3

Collecting pyspark==3.4.1
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285391 sha256=796d6a2f041d0fb4b11f4682b8fc24859444b31bf320b6d46c9d006156bf5fb2
  Stored in directory: /root/.cache/pip/wheels/e9/b4/d8/38accc42606f6675165423e9f0236f8e825f6b6b6048d6743e
Successfully built pyspark
Installing collected packages: pyspark
  Attempting uninstall: pyspark
    Found existing installation: pyspark 3.5.1
    Uninstalling pyspark-3.5.1:
      Successfully uninstalled pyspark-3.5.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fo

In [2]:
from pyspark.sql import SparkSession
from sparknlp.base import DocumentAssembler, Pipeline
from sparknlp.annotator import Tokenizer, Normalizer, BertEmbeddings, SentimentDLModel, LemmatizerModel, UniversalSentenceEncoder
from pyspark.ml import Pipeline as MLPipeline

# Start Spark NLP session
import sparknlp
spark = sparknlp.start()

# Sample dataset (50 sample sentences)
sentences = [
    "I love this movie!", "This was the worst experience.", "Pretty decent overall.",
    "Absolutely fantastic!", "I'm not sure how I feel.", "Worst purchase ever.",
    "Great value for the money.", "It was okay, not great.", "Terrible, just terrible.",
    "Super fun and engaging!", "Would not recommend it.", "Kind of boring.",
    "Loved every minute!", "It was a disaster.", "Highly recommend!",
    "Too expensive for what you get.", "Amazing support team!", "Horrible food.",
    "I'll definitely buy it again.", "Meh, nothing special.", "Exceeded my expectations!",
    "Not worth the hype.", "Incredible storytelling.", "Never again.",
    "Pretty enjoyable!", "Worst customer service.", "Delightful and fresh.",
    "Disappointed.", "Can't wait to try it again!", "A total waste of time.",
    "Superb work!", "Very underwhelming.", "Loved the packaging.",
    "Felt very rushed.", "Unbelievably good.", "Didn't like it at all.",
    "Perfect execution!", "Mediocre at best.", "Simply amazing.",
    "It made my day!", "Forgettable.", "Super smooth experience.",
    "I regret buying it.", "Truly inspiring.", "Boring and repetitive.",
    "Five stars!", "Nothing new.", "Highly entertaining.", "Wasted potential.", "It was alright."
]

# Create DataFrame
df = spark.createDataFrame([(s,) for s in sentences], ["text"])

# Spark NLP pipeline
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

# Add sentence embeddings
use = UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

# Use pretrained sentiment model (Twitter-based)
sentiment_model = SentimentDLModel.pretrained("sentimentdl_use_twitter", "en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

# Create pipeline
pipeline = Pipeline(stages=[document_assembler, use, sentiment_model])

# Run the pipeline
result = pipeline.fit(df).transform(df)

# Show results
result.select("text", "sentiment.result").show(truncate=False)

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentimentdl_use_twitter download started this may take some time.
Approximate size to download 11.4 MB
[OK!]
+-------------------------------+----------+
|text                           |result    |
+-------------------------------+----------+
|I love this movie!             |[positive]|
|This was the worst experience. |[negative]|
|Pretty decent overall.         |[positive]|
|Absolutely fantastic!          |[positive]|
|I'm not sure how I feel.       |[negative]|
|Worst purchase ever.           |[negative]|
|Great value for the money.     |[positive]|
|It was okay, not great.        |[neutral] |
|Terrible, just terrible.       |[negative]|
|Super fun and engaging!        |[positive]|
|Would not recommend it.        |[negative]|
|Kind of boring.                |[negative]|
|Loved every minute!            |[positive]|
|It was a disaster.             |[negative]|
|Highly recommend!            

**Comparison of the Three Approaches**
#Traditional Machine Learning Model

Workflow:

-Started Spark NLP and loaded 50 sample sentences.

-Converted text to documents using DocumentAssembler.

-Generated sentence embeddings with a pretrained Universal Sentence Encoder (tfhub_use).

-Model Classifies sentiment using a pretrained Twitter-based SentimentDLModel.

Outputs each sentence with its predicted sentiment (positive, negative, or neutral).

Essentially, it’s sentiment analysis on short text using pretrained NLP models.

# HuggingFace LLM (GPT-2) Approach

Here we are using real-time sentiment analysis inside PySpark using a Hugging Face model via a UDF (User Defined Function):
Workflow:
-Starting a Spark session.

-Loading sample text data into a Spark DataFrame.

-Defined a UDF that loads the Hugging Face sentiment-analysis pipeline once (lazy loading) and predicts sentiment for each row.

-Applied the UDF to the sentence column to create a predicted_sentiment column.

-Displays the results in the Spark directly.

It lets Spark run distributed Hugging Face sentiment predictions on text data without converting to Pandas.



# Spark NLP Sentiment Analysis  with ROBERTa Model

Sentiment analysis evaluation workflow combining PySpark, Hugging Face Transformers, and scikit-learn:

-Created sample labeled sentences (positive, negative, neutral).

-Started a Spark session and loads the data into a Spark DataFrame.

-Converted to Pandas for Hugging Face inference.

-Loading a pretrained DistilBERT sentiment model (ROBERTa Model)

Predicts sentiment for each sentence and maps it to positive, negative, or neutral.

Evaluated  performance using classification report & accuracy score.
Classification Report:
              precision    recall  f1-score   support

    negative      0.700     1.000     0.824         7
     neutral      0.000     0.000     0.000         4
    positive      0.900     1.000     0.947         9

    accuracy                          0.800        20
   macro avg      0.533     0.667     0.590        20
weighted avg      0.650     0.800     0.715        20


Accuracy Score: 0.8

# Reflection on these approcahes

First approach (Spark NLP) feels like the “traditional Spark ML” way — scalable, self-contained, but using older or domain-specific models. It's the most cluster-friendly but possibly the least accurate compared to modern Transformer-based approaches.

Second approach (HuggingFace in Pandas) is great for experiments and small datasets because it gives you the latest model power quickly — but you lose Spark's scale. It's the most “ML research notebook” friendly.

Third approach (HF RoBERTa in Spark UDF) is the middle ground — keeps Spark's scale, uses cutting-edge models, but needs careful optimization. It's the most production-ready if you want Transformer accuracy and Spark's parallelism.

If I had to pick for large-scale production sentiment analysis:

I would choose RoBERTa for accuracy and scale,

But if speed and simplicity matter more than cutting-edge accuracy, Traditional ML approach with Spark NLP is cleaner.

HuggingFacer Model is best for prototyping & evaluation before scaling up.




