# SMS Spam Filtering

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/classification/multilayer_perceptron_classifier/sms_spam_filtering.ipynb">
      <img src="https://avatars.githubusercontent.com/u/33467679?s=200&v=4" width="32px" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/classification/multilayer_perceptron_classifier/sms_spam_filtering.ipynb">
      <img src="https://github.githubassets.com/assets/GitHub-Mark-ea2971cee799.png" width="32px" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/ai-ml-recipes/main/notebooks/classification/multilayer_perceptron_classifier/sms_spam_filtering.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/bigquery/import?url=https://github.com/GoogleCloudPlatform/ai-ml-recipes/blob/main/notebooks/classification/multilayer_perceptron_classifier/sms_spam_filtering.ipynb">
      <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s" alt="BQ logo" width="35">
      Open in BQ Studio
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fai-ml-recipes%2Fmain%2Fnotebooks%2Fclassification%2Fmultilayer_perceptron_classifier%2Fsms_spam_filtering.ipynb">
    <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
    Open in Colab Enterprise
    </a>
  </td>

</table>

This notebook shows how to predict if an SMS is spam or not using SparkML's Multilayer Perceptron Classifier

#### **Steps**
Using Spark, 
1) It reads the table [SMS Spam Collection](https://doi.org/10.24432/C5CC84) from [gs://dataproc-metastore-public-binaries/sms_spam_collection/](https://console.cloud.google.com/storage/browser/dataproc-metastore-public-binaries/sms_spam_collection) 
2) It parses process the dataset to choose features and train the ML model (fits the classification model) to predict a target value.  
   **Features**: text
   **Target**: spam or ham 
3) It evaluates and plot the results.  

#### **Details of the dataset**

- A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. The Grumbletext Web site is: http://www.grumbletext.co.uk/.
- A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. The NUS SMS Corpus is avalaible at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/.
- A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis available at http://etheses.bham.ac.uk/253/1/Tagg09PhD.pdf.

### Setup

#### Identity and Access Management (IAM)

Make sure the service account running this notebook has the required permissions:

- **Run the notebook**
  - AI Platform Notebooks Service Agent
  - Notebooks Admin
  - Vertex AI Administrator
- **Read files from bucket**
  - Storage Object Viewer
- **Run Dataproc jobs**
  - Dataproc Service Agent
  - Dataproc Worker

In [None]:
#### Import dependencies
import itertools
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import confusion_matrix

from pyspark.ml import Pipeline

from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF, StringIndexer
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [None]:
from pyspark.sql import SparkSession

In [None]:
spark = SparkSession.builder \
    .appName("SMS spam filtering with Multilayer Perceptron Classifier") \
    .enableHiveSupport() \
    .getOrCreate()

In [None]:
raw_dataset = spark.read.option("header",True).csv("gs://dataproc-metastore-public-binaries/sms_spam_collection/")

### Exploratory Data Analysis

In [None]:
# Show the count of each class
class_counts = raw_dataset.groupBy('label').count()

# Calculate and display the class distribution
total_count = raw_dataset.count()
class_counts.withColumn('Percentage', (class_counts['count'] / total_count) * 100).show()

|label|count|       Percentage|
|-----|-----|-----------------|
|  ham| 4827|86.59849300322928|
| spam|  747|13.40150699677072|


### Process dataset to create features

In [None]:
# StringIndexer to convert string labels to numerical labels
label_indexer = StringIndexer(inputCol="label", outputCol="label_index")

In [None]:
# Tokenize the SMS text
tokenizer = Tokenizer(inputCol="text", outputCol="words")

In [None]:
# Remove stop words
remover = StopWordsRemover(inputCol="words", outputCol="filtered_words")

In [None]:
# Hashing TF to convert words to numerical features
hashingTF = HashingTF(inputCol="filtered_words", outputCol="numerical_features", numFeatures=5000)

In [None]:
# TF-IDF
idf = IDF(inputCol="numerical_features", outputCol="features")

In [None]:
prep_pipeline = Pipeline(stages=[label_indexer, tokenizer, remover, hashingTF, idf])

processed_dataset = prep_pipeline.fit(raw_dataset).transform(raw_dataset)

dataset = processed_dataset.select("label_index", "features")

In [None]:
dataset.show(5)

|label_index|            features|
|-----------|--------------------|
|        0.0|(5000,[98,740,750...|
|        0.0|(5000,[1727,2630,...|
|        1.0|(5000,[581,587,10...|
|        0.0|(5000,[594,862,31...|
|        0.0|(5000,[1197,1515,...|

### Train/Fit the model

In [None]:
# Split the dataset into training and testing sets
(trainingData, testData) = dataset.randomSplit([0.8, 0.2], seed=42)

# Create an MLP classifier
layers = [5000, 100, 50, 2]  # Input: 5000 features, two hidden layers, output: binary (spam or ham)
mlp_classifier = MultilayerPerceptronClassifier(
    labelCol="label_index",
    featuresCol="features",
    layers=layers,
    blockSize=128,
    seed=42)

pipeline = Pipeline(stages=[mlp_classifier])

# Train the MLP classifier
mlp_model = pipeline.fit(trainingData)

### Evaluate the model

In [None]:
# Make predictions on the test set
predictions = mlp_model.transform(testData)

# Evaluate the classifier using BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(
    labelCol="label_index",
    rawPredictionCol="rawPrediction",  # Use raw prediction scores
    metricName="areaUnderROC"  # You can choose other metrics like "areaUnderPR" if needed
)

areaUnderROC = evaluator.evaluate(predictions)

print(f"Area Under ROC: {areaUnderROC}")

In [None]:
# Convert the PySpark DataFrame to a Pandas DataFrame for confusion matrix
predictions_pd = predictions.select("label_index", "prediction").toPandas()

# Compute the confusion matrix
confusion = confusion_matrix(predictions_pd["label_index"], predictions_pd["prediction"])

# Visualize the confusion matrix
def plot_confusion_matrix(cm, classes, normalize=False, title="Confusion Matrix", cmap=plt.cm.Blues):
    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation="nearest", cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print("Confusion matrix, without normalization")

    print(cm)

    thresh = cm.max() / 2.0
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

class_names = ["ham", "spam"]
plot_confusion_matrix(confusion, classes=class_names, title="Confusion Matrix")

plt.show()


In [None]:
# Stop the Spark session
spark.stop()