# **Spam Detection**

## Spam Message Classification with Spark NLP

In this notebook, you'll build an end-to-end **SMS spam detection pipeline** using existing pretrained models with [Spark NLP](https://sparknlp.org/) and [Apache Spark](https://spark.apache.org/).

You’ll learn how to:

- Set up a Spark NLP pipeline in Python
- Use pretrained models (Universal Sentence Encoder and a spam classifier)
- Build and run a text classification workflow
- Visualize predictions made on sample SMS texts

By the end, you’ll understand how to integrate Spark NLP into scalable NLP projects.

### Compatibility

| Platform                     | Compatible | Recommended | Notes                                                                                                             |
| ---------------------------- | ---------- | ----------- | ----------------------------------------------------------------------------------------------------------------- |
| **Local (e.g., M1 MacBook)** | ✅ Yes     | ✅ Yes      | -                                                                                                                 |
| **Google Colab**             | ✅ Yes     | ✅ Yes      | -                                                                                                                 |
| **Midway3 Login Node**       | ✅ Yes     | ❌ No       | It is generally not recommended to run Spark jobs on the login nodes.                                             |
| **Midway3 Compute Node**     | ✅ Yes     | ✅ Yes      | Use with `sinteractive`, `scode` or [Open OnDemand](https://midway3-ondemand.rcc.uchicago.edu/) (Jupyter/VSCode). |

### Credits

Adapted from [JohnSnowLabs Spark NLP SMS Spam Classifier Example](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/CLASSIFICATION_EN_SPAM.ipynb)


## 1. Environment Setup


This notebook is tested to run with Python 3.12.11.

Please **run the following command on the the login node** to install the required packages if you haven't done so already.


In [1]:
# Install PySpark and Spark NLP
# If you are on Midway, run this command in the activated python environment on rcc.midway3, not in this command line
%pip install pyspark==3.5.6 spark-nlp==6.0.5 pandas numpy matplotlib seaborn

Collecting pyspark==3.5.6
  Downloading pyspark-3.5.6.tar.gz (317.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.4/317.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting spark-nlp==6.0.5
  Downloading spark_nlp-6.0.5-py2.py3-none-any.whl.metadata (19 kB)
Downloading spark_nlp-6.0.5-py2.py3-none-any.whl (718 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m718.9/718.9 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.6-py2.py3-none-any.whl size=317895798 sha256=d241445b456fe54367ac1c48857d25ae25dae1b00beb5e8710ff3399c055b6e3
  Stored in directory: /root/.cache/pip/wheels/64/62/f3/ec15656ea4ada0523cae62a1827fe7beb55d3c8c87174aad4a
Successfully built pyspark
Installing collected packages: spark-nlp, pyspark
  Attempting u

In [2]:
import os
import json
import pandas as pd
import numpy as np

import sparknlp
import pyspark.sql.functions as F

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.types import StringType, IntegerType

## 2. Start Spark Session


In [3]:
spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 6.0.5
Apache Spark version: 3.5.6


## 3. Pretained Models


### SparkNLP Models Hub

Spark NLP provides a **Models Hub** containing thousands of pretrained models and pipelines:

- Explore the hub: [sparknlp.org/models](https://sparknlp.org/models)

Even though these models are usually not the latest SOTA (as is usually found on HuggingFace), they are **production-ready** and can be used as-is or fine-tuned for specific tasks.

### Models used in this notebook

- **Universal Sentence Encoder**: "tfhub_use"  
  A bridge to TensorFlow Hub's USE model  
  Model page: [sparknlp.org/2020/04/17/tfhub_use.html](https://sparknlp.org/2020/04/17/tfhub_use.html)

- **Spam Classifier**: "classifierdl_use_spam"  
  Classifies SMS as `spam` or `ham`  
  Model page: [sparknlp.org/2021/01/09/classifierdl_use_spam_en.html](https://sparknlp.org/2021/01/09/classifierdl_use_spam_en.html)

The easiest way to load these models is to use the `pretrained()` method, which automatically downloads the model and its dependencies (if not already cached) and loads it into your Spark NLP pipeline.

```python
UniversalSentenceEncoder.pretrained("tfhub_use", "en")
ClassifierDLModel.pretrained("classifierdl_use_spam", "en")
```

However, since we will be also working with Midway3, where **compute node internet access is disabled**, we will need to download the models to a directory on login nodes and load them from there.


In [4]:
### Select Model
model_name = "classifierdl_use_spam"

The models used in this notebook are already downloaded to the course directory.

But if you want to download them to your personal directory, please uncomment the following cell and run it.

If you choose to download the models to your personal directory, you will need to change the `MODEL_DIR` variable in the next section to point to the `~/cache_pretrained/` folder where the models are downloaded to by default.

Usually the S3 URIs of the models can be found on their model pages.


In [None]:
# from sparknlp.pretrained import ResourceDownloader

# # List of (model_name, language) pairs used in your pipeline
# MODELS = [
#     "s3://auxdata.johnsnowlabs.com/public/models/tfhub_use_en_2.4.0_2.4_1587136330099.zip",            # Universal Sentence Encoder
#     "s3://auxdata.johnsnowlabs.com/public/models/classifierdl_use_spam_en_2.7.1_2.4_1610187019592.zip" # SMS spam/ham classifier
# ]

# # Download loop
# for model_path in MODELS:
#     print(f"Downloading {model_path} …")
#     ResourceDownloader.downloadModelDirectly(model_path)

# print("All requested models are now cached locally")

In [None]:
# Run this if you are on Midway
# Set the model directory to where the models are saved
MODEL_DIR = "../models/"

use_path = os.path.join(MODEL_DIR, "tfhub_use_en_2.4.0_2.4_1587136330099")
classifier_path = os.path.join(
    MODEL_DIR, "classifierdl_use_spam_en_2.7.1_2.4_1610187019592"
)

## 4. Example Data


We’ll use a few hand-labeled SMS messages to evaluate spam classification.


In [9]:
text_list = [
    "Hiya do u like the hlday pics looked horrible in them so took mo out! Hows the camp Amrca thing? Speak soon Serena:)",  # HAM
    "U have a secret admirer who is looking 2 make contact with U-find out who they R*reveal who thinks UR so special-call on 09058094594",  # SPAM
    "Hey, just checking in. How was the exam? Let me know when you're free to catch up.",  # HAM
    "Congratulations! You've won a £1000 Tesco gift card. To claim, text WIN to 80062. Hurry, offer ends soon!",  # SPAM
    "Dinner's at 7. Don't be late again 😄 Mum's making lasagna!",  # HAM
    "You’ve been selected for a guaranteed cash prize of £2000! To claim, call 09061701461 NOW!",  # SPAM
    "Got to the hotel safely. Weather’s great. Wish you were here!",  # HAM
    "FreeMsg: CLAIM YOUR FREE RINGTONE NOW! Just text the word ‘TONE’ to 87131. Don’t miss out!",  # SPAM
    "Happy birthday! 🎉 Hope today is full of good vibes and cake.",  # HAM
    "URGENT! Your mobile number has won £5000 cash. Call 09066362231 to claim your prize. T&C apply.",  # SPAM
]

## 5. Define Spark NLP pipeline


It’s a common practice in **PySpark** (which we will touch on in wk1.3-netflix-challenge) and **Spark NLP** to build reusable, modular pipelines using `Pipeline` and its components.

For more information on building custom pipelines in Spark NLP:
[Custom Pipelines Guide](https://sparknlp.org/api/python/user_guide/custom_pipelines.html)

Below are the key components used in this SMS spam classification example:

| Class                                         | Description                                                                                                                                                                                   |
| --------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `sparknlp.base.DocumentAssembler`             | The entry point for any Spark NLP pipeline. It transforms raw text into `document` type, which downstream annotators can consume. All NLP pipelines begin with this step.                     |
| `sparknlp.annotator.UniversalSentenceEncoder` | A Spark NLP wrapper for TensorFlow Hub’s Universal Sentence Encoder (USE). Converts text into dense vector embeddings suitable for classification and semantic similarity.                    |
| `sparknlp.annotator.ClassifierDLModel`        | A deep learning-based text classifier. Consumes sentence/document embeddings and predicts a class label. Trained on top of embeddings like USE, BERT, etc.                                    |
| `pyspark.ml.Pipeline`                         | A core Spark ML class used to chain multiple transformers and estimators into a single pipeline. Used for both training (fit) and inference (transform). Enables a clean and scalable design. |


In [6]:
import os
import urllib.request

def has_internet(timeout: int = 3) -> bool:
    try:
        urllib.request.urlopen("https://clients3.google.com/generate_204", timeout=timeout)
        return True
    except Exception:
        return False

online_ok = has_internet()
print(f"Internet available: {online_ok}")

Internet available: True


In [7]:
# Build Data Assembly
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

# Load models: online if possible; otherwise local paths
if online_ok:
    use = UniversalSentenceEncoder.pretrained("tfhub_use", "en")
    document_classifier = ClassifierDLModel.pretrained("classifierdl_use_spam", "en")
else:
    use = UniversalSentenceEncoder.load(use_path)
    document_classifier = ClassifierDLModel.load(classifier_path)

# Set input and output columns
use = use.setInputCols(["document"]).setOutputCol("sentence_embeddings")
document_classifier = (
    document_classifier
    .setInputCols(["sentence_embeddings"])
    .setOutputCol("class_")
)

# Build the pipeline
nlpPipeline = Pipeline(stages=[documentAssembler, use, document_classifier])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
classifierdl_use_spam download started this may take some time.
Approximate size to download 21.3 MB
[OK!]


## 6. Run the pipeline


In [10]:
df = spark.createDataFrame(text_list, StringType()).toDF("text")
result = nlpPipeline.fit(df).transform(df)

## 7. Visualize results


In [11]:
result.select(
    F.explode(F.arrays_zip(result.document.result, result.class_.result)).alias("cols")
).select(
    F.expr("cols['0']").alias("document"), F.expr("cols['1']").alias("class")
).show(
    truncate=False
)

+------------------------------------------------------------------------------------------------------------------------------------+-----+
|document                                                                                                                            |class|
+------------------------------------------------------------------------------------------------------------------------------------+-----+
|Hiya do u like the hlday pics looked horrible in them so took mo out! Hows the camp Amrca thing? Speak soon Serena:)                |ham  |
|U have a secret admirer who is looking 2 make contact with U-find out who they R*reveal who thinks UR so special-call on 09058094594|ham  |
|Hey, just checking in. How was the exam? Let me know when you're free to catch up.                                                  |ham  |
|Congratulations! You've won a £1000 Tesco gift card. To claim, text WIN to 80062. Hurry, offer ends soon!                           |spam |
|Dinner's at 

It seems that the spam classifier is working well on these examples! Except for the second message, which should be labeled as `spam` instead of `ham`.
