# Spam Classifier Word Embeddings

Let's have a look at word embeddings. Therefore, we want to 

1. Apply the [TensorFlow tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings) to the Spam/Ham dataset.

2. Load the embedings in the [Embedding Projector](http://projector.tensorflow.org/).

In this notebook you will find some help how to load and prepare the Spam/Ham data to apply those NLP models.

In [152]:
# importing all needed libraries and functions
import io
import os
import re
import shutil
import string
import pandas as pd
import numpy as np
import tensorflow as tf

from sklearn.model_selection import train_test_split
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization


## Loading data

Load the Spam and Ham data with pandas and afterwards split the data into a train, a validation and a test set as usual with sklearns train_test_split function.

After spliting the data, we need to transform the data to tensorflow "tensors".

In [153]:
# Load spam/ham data
data = pd.read_csv(
    "data/SMSSpamCollection.txt",
    encoding="utf-8",
    header=None,
    delimiter="\t",
    names=["target", "text"],
)

# Encoding target variable
data["target"] = np.where(data["target"] == "spam", 1, 0)


In [154]:
# Splitting data in train, validation and test set
X_train, X_test, y_train, y_test = train_test_split(data["text"], data["target"], random_state=0, test_size=0.2)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, random_state=0, test_size=0.25)

In [155]:
# transform data to tf.dataset
# try it with tf.data.Dataset.from_tensor_slices, where you specify the text and the target separately.
# one solution is shown here: https://medium.com/when-i-work-data/converting-a-pandas-dataframe-into-a-tensorflow-dataset-752f3783c168

train_ds = tf.data.Dataset.from_tensor_slices((X_train.values, y_train.values))
val_ds = tf.data.Dataset.from_tensor_slices((X_val.values, y_val.values))
test_ds = tf.data.Dataset.from_tensor_slices((X_test.values, y_test.values))


## Text preprocessing

follow the steps in the tutorial, our dataset has a vocab_size of 7546.

In [156]:
def custom_standardization(input_data):
    """Text preprocessing: lowercases, no punctuation

    Args:
        input_data (tf.dataframe): [text, formated as tf.string]

    Returns:
        [tf.dataframe]: [preprocessed text]
    """
    text_lower = tf.strings.lower(input_data)
    return tf.strings.regex_replace(text_lower,
                                  '[%s]' % re.escape(string.punctuation), '')

In [157]:
# Vocabulary size and number of words in a sequence.
vocab_size = 7546
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to
# integers. Note that the layer uses the custom standardization defined above.
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# Make a text-only dataset (no labels) and call adapt to build the vocabulary.
text_ds = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(text_ds)

## Create a classification model

In the tutorial the batch-size was defined when loading the data as tf.Dataset. That's why we have to specify this now too. This is especially important for training the model.
You can create the batches as shown here: [tf.data.Dataset.batch() method, combined with repeat() method](https://www.gcptutorials.com/article/how-to-use-batch-method-in-tensorflow).
Specify the `batch_size`, for example take 32.

In [158]:
# create batched datasets for training, validation and test datasets.
dataset_train_batch = train_ds.repeat().batch(batch_size=32)
dataset_val_batch = val_ds.repeat().batch(batch_size=32)
dataset_test_batch = test_ds.repeat().batch(batch_size=32)

In [159]:
# model structure
embedding_dim=16

model = Sequential([
  vectorize_layer,
  Embedding(vocab_size, embedding_dim, name="embedding"),
  GlobalAveragePooling1D(),
  Dense(16, activation='relu'),
  Dense(16, activation='relu'),
  Dense(1)
])

In [160]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [161]:
# model compiling using Adam optimizer and BinaryCrossentropy loss
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [162]:
# training the model
model.fit(
    dataset_train_batch,
    validation_data=dataset_val_batch,
    epochs=10,
    steps_per_epoch=100,
    validation_steps=20,
    callbacks=[tensorboard_callback]
    )

Epoch 1/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 4ms/step - accuracy: 0.8672 - loss: 0.3852 - val_accuracy: 0.8828 - val_loss: 0.3343
Epoch 2/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8669 - loss: 0.3644 - val_accuracy: 0.8828 - val_loss: 0.3280
Epoch 3/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8656 - loss: 0.3585 - val_accuracy: 0.8828 - val_loss: 0.3186
Epoch 4/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8666 - loss: 0.3429 - val_accuracy: 0.8828 - val_loss: 0.3011
Epoch 5/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8672 - loss: 0.3073 - val_accuracy: 0.8828 - val_loss: 0.2446
Epoch 6/10
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8841 - loss: 0.2008 - val_accuracy: 0.9516 - val_loss: 0.1263
Epoch 7/10
[1m100/100[0m 

<keras.src.callbacks.history.History at 0x1e158cb9410>

In [163]:
# calculating the loss and accuracy on the test set.
loss, accuracy = model.evaluate(dataset_test_batch, verbose=2, steps=25)
print(f'Model accuracy: {accuracy}')

25/25 - 0s - 1ms/step - accuracy: 0.9850 - loss: 0.0607
Model accuracy: 0.9850000143051147


In [164]:
# retrieving information summary about the model
model.summary()

## Retrieve the trained word embeddings and save them to disk

Follow the instructions in the tutorial.


In [55]:
#your code
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [56]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
  if index == 0:
    continue  # skip 0, it's padding.
  vec = weights[index]
  out_v.write('\t'.join([str(x) for x in vec]) + "\n")
  out_m.write(word + "\n")
out_v.close()
out_m.close()

## Visualize the embeddings
To visualize the embeddings, upload them to the embedding projector.

Open the [Embedding Projector](http://projector.tensorflow.org/) (this can also run in a local TensorBoard instance).

Click on "Load data".

Upload the two files you created above: vecs.tsv and meta.tsv.

In [165]:
texts = [
    "Hi, are we still meeting tomorrow?",
    "call txt text uk free claim stop www reply 150p",
    "how are you? will we claim tickets free now?",
    "today is a good weather to rub a bank in Germany! Will you come with me?",
    "This is your chance to get a free vacation on Bahamas! Call or text me right now and for 150p you will get full details!",
    "Hey, just wanted to check in and see how you're doing.",
]

inputs = tf.data.Dataset.from_tensor_slices(texts).batch(1)

logits = model.predict(inputs)
probabilities = tf.sigmoid(logits).numpy().flatten()

for text, p in zip(texts, probabilities):
    label = "SPAM" if p >= 0.5 else "NOT SPAM"
    print(f"{label} ({p:.5f}) → {text}")

[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
NOT SPAM (0.00384) → Hi, are we still meeting tomorrow?
SPAM (0.97693) → call txt text uk free claim stop www reply 150p
NOT SPAM (0.18955) → how are you? will we claim tickets free now?
NOT SPAM (0.02598) → today is a good weather to rub a bank in Germany! Will you come with me?
SPAM (0.99295) → This is your chance to get a free vacation on Bahamas! Call or text me right now and for 150p you will get full details!
NOT SPAM (0.00686) → Hey, just wanted to check in and see how you're doing.
