# Spam Classifier Word Embeddings

Let's have a look at word embeddings. Therefore, we want to 

1. Apply the [TensorFlow tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings) to the Spam/Ham dataset.

2. Load the embedings in the [Embedding Projector](http://projector.tensorflow.org/).

In this notebook you will find some help how to load and prepare the Spam/Ham data to apply those NLP models.

In [113]:
# importing all needed libraries and functions
import io
import os
import re
import shutil
import string
import tensorflow as tf

from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import Dropout
import pandas as pd
import numpy as np

from transformers import pipeline

## Loading data

Load the Spam and Ham data with pandas and afterwards split the data into a train, a validation and a test set as usual with sklearns train_test_split function.

After spliting the data, we need to transform the data to tensorflow "tensors".

In [87]:
# Load spam/ham data
data = pd.read_csv(
    "data/SMSSpamCollection.txt",
    encoding="utf-8",
    header=None,
    delimiter="\t",
    names=["target", "text"],
)
# Encoding target variable
data["target"] = np.where(data["target"] == "spam", 1, 0)


In [88]:
# Splitting data in train, validation and test set
X_train, X_test, y_train, y_test = train_test_split(
    data["text"], data["target"], random_state=0, test_size=0.2
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, random_state=0, test_size=0.25
)




In [89]:
# transform data to tf.dataset
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
test_ds = tf.data.Dataset.from_tensor_slices((X_test, y_test))
# try it with tf.data.Dataset.from_tensor_slices, where you specify the text and the target separately.
# one solution is shown here: https://medium.com/when-i-work-data/converting-a-pandas-dataframe-into-a-tensorflow-dataset-752f3783c168

## Text preprocessing

follow the steps in the tutorial, our dataset has a vocab_size of 7546.

In [90]:
def custom_standardization(input_data):
    """Text preprocessing: lowercases, no punctuation

    Args:
        input_data (tf.dataframe): [text, formated as tf.string]

    Returns:
        [tf.dataframe]: [preprocessed text]
    """
    text_lower = tf.strings.lower(input_data)
    return tf.strings.regex_replace(text_lower,
                                  '[%s]' % re.escape(string.punctuation), '')

In [91]:
# Vocabulary size and number of words in a sequence.
vocab_size = 7546  # taken from notebook 1
sequence_length = 100

# Use the text vectorization layer to normalize, split, and map strings to 
# integers. Note that the layer uses the custom_standardization function defined above. 
# Set maximum_sequence length as all samples are not of the same length.
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=vocab_size,
    output_mode='int',
    output_sequence_length=sequence_length)

# Make a text-only dataset (without labels), then call adapt
# to build the vocabulary.
train_text = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

## Create a classification model

In the tutorial the batch-size was defined when loading the data as tf.Dataset. That's why we have to specify this now too. This is especially important for training the model.
You can create the batches as shown here: [tf.data.Dataset.batch() method, combined with repeat() method](https://www.gcptutorials.com/article/how-to-use-batch-method-in-tensorflow).
Specify the `batch_size`, for example take 32.

In [92]:
# create batched datasets for training, validation and test datasets.
dataset_train_batch = train_ds.repeat().batch(batch_size=32)
dataset_val_batch = val_ds.repeat().batch(batch_size=32)
dataset_test_batch = test_ds.repeat().batch(batch_size=32)

# checking shape and type of batched dataset 
dataset_train_batch

<_BatchDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

In [114]:
# model structure
embedding_dim=128
# model = Sequential([
#   vectorize_layer,
#   Embedding(vocab_size, embedding_dim, name="embedding"),
#   GlobalAveragePooling1D(),
#   Dense(32, activation='relu'),
#   Dense(32, activation='relu'),
#   Dense(32, activation='relu'),
#   Dense(32, activation='relu'),
#   Dense(1, activation='sigmoid')
# ])

model = Sequential([
    vectorize_layer,
    Embedding(vocab_size, embedding_dim, name="embedding"),
    GlobalAveragePooling1D(),
    Dense(64, activation='relu'),
    Dropout(0.5),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')  # WICHTIG: sigmoid activation für Binary Classification!
])

In [115]:
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")

In [117]:
# model compiling using Adam optimizer and BinaryCrossentropy loss
model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

In [118]:
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=3,
    restore_best_weights=True
)

In [119]:
# training the model
model.fit(
    dataset_train_batch,
    validation_data=dataset_val_batch,
    epochs=100,
    steps_per_epoch=100,
    validation_steps=25,
    callbacks=[tensorboard_callback, early_stopping]
    )

Epoch 1/100


  output, from_logits = _get_logits(


[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.8569 - loss: 0.4309 - val_accuracy: 0.8750 - val_loss: 0.3521
Epoch 2/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.8669 - loss: 0.3786 - val_accuracy: 0.8750 - val_loss: 0.3579
Epoch 3/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.8656 - loss: 0.3498 - val_accuracy: 0.8750 - val_loss: 0.2931
Epoch 4/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.8672 - loss: 0.2392 - val_accuracy: 0.9375 - val_loss: 0.1545
Epoch 5/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.9488 - loss: 0.1519 - val_accuracy: 0.9062 - val_loss: 0.2171
Epoch 6/100
[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.9766 - loss: 0.0908 - val_accuracy: 0.9775 - val_loss: 0.0756
Epoch 7/100
[1m100/100[0m [32m━

<keras.src.callbacks.history.History at 0x1e0868e3e50>

In [120]:
# calculating the loss and accuracy on the test set.
loss, accuracy = model.evaluate(dataset_test_batch, verbose=2, steps=25)
print(f'Model accuracy: {accuracy}')

25/25 - 0s - 2ms/step - accuracy: 0.9825 - loss: 0.0643
Model accuracy: 0.9825000166893005


In [121]:
# retrieving information summary about the model
model.summary()

## Retrieve the trained word embeddings and save them to disk

Follow the instructions in the tutorial.


In [122]:
#your code
weights = model.get_layer('embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()

In [111]:
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')

for index, word in enumerate(vocab):
    if index == 0:
        continue  # skip 0, it's padding.
    vec = weights[index]
    out_v.write('\t'.join([str(x) for x in vec]) + "\n")
    out_m.write(word + "\n")
out_v.close()
out_m.close()

## Visualize the embeddings
To visualize the embeddings, upload them to the embedding projector.

Open the [Embedding Projector](http://projector.tensorflow.org/) (this can also run in a local TensorBoard instance).

Click on "Load data".

Upload the two files you created above: vecs.tsv and meta.tsv.

In [123]:
# Statt pipeline, nutze dein trainiertes Modell direkt
def predict_spam(text, model, vectorize_layer):
    """
    Predict if text is spam or ham using your trained model
    
    Args:
        text: string or list of strings
        model: your trained keras model
        vectorize_layer: the vectorization layer from training
    
    Returns:
        predictions and probabilities
    """
    # Convert to tensor if single string
    if isinstance(text, str):
        text = [text]
    
    # Create dataset
    pred_ds = tf.data.Dataset.from_tensor_slices(text).batch(1)
    
    # Get predictions
    predictions = model.predict(pred_ds)
    
    # Convert logits to probabilities
    probabilities = tf.nn.sigmoid(predictions).numpy()
    
    # Convert to labels
    labels = ['ham' if p < 0.5 else 'spam' for p in probabilities]
    
    return labels, probabilities



In [124]:
# Beispiel Nutzung:
test_texts = [
    "Congratulations! You won a free iPhone! Click here now!",
    "Hey, are we still meeting for coffee tomorrow?",
    "URGENT: Your account will be closed.",
    "congratulations you have won free tickets to Bahamas visit www free stuff uk.",
    "Hey can you help me solving a problem with my code?",
    "call txt text uk free claim stop www reply 150p",
    "Hi, you don't believe it. i got a call from the uk and i won a lottery! i only hat to reply 150p to a txt message!"
]

labels, probs = predict_spam(test_texts, model, vectorize_layer)

for text, label, prob in zip(test_texts, labels, probs):
    print(f"Text: {text[:50]}...")
    print(f"Prediction: {label} (probability: {prob[0]:.4f})")
    print("-" * 50)

[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step 
Text: Congratulations! You won a free iPhone! Click here...
Prediction: spam (probability: 0.6752)
--------------------------------------------------
Text: Hey, are we still meeting for coffee tomorrow?...
Prediction: spam (probability: 0.5015)
--------------------------------------------------
Text: URGENT: Your account will be closed....
Prediction: spam (probability: 0.5440)
--------------------------------------------------
Text: congratulations you have won free tickets to Baham...
Prediction: spam (probability: 0.7150)
--------------------------------------------------
Text: Hey can you help me solving a problem with my code...
Prediction: spam (probability: 0.5021)
--------------------------------------------------
Text: call txt text uk free claim stop www reply 150p...
Prediction: spam (probability: 0.7240)
--------------------------------------------------
Text: Hi, you don't believe it. i got a call from