## Spam Detection Project Using TensorFlow

This project will go through:

1. Data cleaning and preprocessing (including tokenization).
2. Text vectorization with TensorFlow’s TextVectorization layer.
3. Building, training, and evaluating a simple neural network model for spam detection.

##### Project Setup
**Step 1: Install TensorFlow**

Ensure you have TensorFlow installed:

```bash
pip install tensorflow
```

In [1]:
import tensorflow as tf

##### Sample dataset (mock data)

In [2]:
texts = tf.constant([
    "Congratulations! You've won a free ticket to the Bahamas. Call now!",
    "Hey, could you meet me at the coffee shop tomorrow?",
    "Limited time offer! Get 50% off on all items. Buy now!",
    "Are you free to join our team meeting later today?",
    "Important notice: Your account has been compromised. Verify now!",
    "Let's catch up soon. How about lunch next week?"
])

# Labels (1 for spam, 0 for not spam)
labels = tf.constant([1, 0, 1, 0, 1, 0])

#### Step 1: Data Cleaning and Preprocessing

In [5]:
# Clean text function: lowercasing and removing punctuation
def clean_text(text):
    text = tf.strings.lower(text)
    text = tf.strings.regex_replace(text, r'[^\w\s]', '')  # Remove punctuation
    return text

In [6]:
# Apply cleaning function
texts = clean_text(texts)

#### Step 2: Text Vectorization

In [8]:
# Define stop words manually (optional)
stop_words = ["a", "the", "is", "to", "and", "in", "of", "on", "for", "now"]

# Custom standardization function to remove stop words
def custom_standardization(input_text):
    lowercase_text = tf.strings.lower(input_text)
    cleaned_text = tf.strings.regex_replace(lowercase_text, r'[^\w\s]', '')  # Remove punctuation
    for word in stop_words:
        cleaned_text = tf.strings.regex_replace(cleaned_text, r'\b' + word + r'\b', '')
    return cleaned_text

In [9]:
# Configure TextVectorization layer
max_tokens = 1000  # Maximum vocabulary size
output_sequence_length = 20  # Maximum number of words per message

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode='int',
    output_sequence_length=output_sequence_length,
    standardize=custom_standardization
)

In [10]:
# Adapt the vectorization layer to our texts
vectorize_layer.adapt(texts)

# Vectorize the texts
vectorized_texts = vectorize_layer(texts)


In [11]:
print("Vectorized Texts:", vectorized_texts)

Vectorized Texts: tf.Tensor(
[[37  4  6  3 13 44 41  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [33 36  2 23 24 45 39 16 10  0  0  0  0  0  0  0  0  0  0  0]
 [26 12 18 35 50 19 47 30 42  0  0  0  0  0  0  0  0  0  0  0]
 [46  2  3 29 17 14 22 28 11  0  0  0  0  0  0  0  0  0  0  0]
 [31 20  5 48 34 43 38  8  0  0  0  0  0  0  0  0  0  0  0  0]
 [27 40  9 15 32 49 25 21  7  0  0  0  0  0  0  0  0  0  0  0]], shape=(6, 20), dtype=int64)


#### Step 3: Building the Model

In [13]:
# Define a Sequential model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=max_tokens, output_dim=16, input_length=output_sequence_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
])

# Compile the model
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])



In [14]:
# Print model summary
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 20, 16)            16000     
                                                                 
 global_average_pooling1d_1  (None, 16)                0         
  (GlobalAveragePooling1D)                                       
                                                                 
 dense_2 (Dense)             (None, 16)                272       
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 16289 (63.63 KB)
Trainable params: 16289 (63.63 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [15]:
# Train the model on the vectorized texts
history = model.fit(vectorized_texts, labels, epochs=10, verbose=2)

Epoch 1/10
1/1 - 3s - loss: 0.6927 - accuracy: 0.5000 - 3s/epoch - 3s/step
Epoch 2/10
1/1 - 0s - loss: 0.6919 - accuracy: 0.5000 - 14ms/epoch - 14ms/step
Epoch 3/10
1/1 - 0s - loss: 0.6912 - accuracy: 0.5000 - 16ms/epoch - 16ms/step
Epoch 4/10
1/1 - 0s - loss: 0.6905 - accuracy: 0.5000 - 18ms/epoch - 18ms/step
Epoch 5/10
1/1 - 0s - loss: 0.6898 - accuracy: 0.5000 - 17ms/epoch - 17ms/step
Epoch 6/10
1/1 - 0s - loss: 0.6891 - accuracy: 0.5000 - 17ms/epoch - 17ms/step
Epoch 7/10
1/1 - 0s - loss: 0.6883 - accuracy: 0.5000 - 9ms/epoch - 9ms/step
Epoch 8/10
1/1 - 0s - loss: 0.6876 - accuracy: 0.5000 - 13ms/epoch - 13ms/step
Epoch 9/10
1/1 - 0s - loss: 0.6868 - accuracy: 0.6667 - 17ms/epoch - 17ms/step
Epoch 10/10
1/1 - 0s - loss: 0.6860 - accuracy: 0.6667 - 15ms/epoch - 15ms/step


#### Step 5: Evaluating the Model

In [16]:
# Evaluate the model
loss, accuracy = model.evaluate(vectorized_texts, labels)
print("Model accuracy:", accuracy)

# Test the model with new samples
new_texts = tf.constant([
    "Exclusive offer! Act fast to claim your free reward.",
    "Hi, just wanted to check in and see how you are doing."
])

Model accuracy: 1.0


In [17]:
# Clean and vectorize new texts
new_texts = clean_text(new_texts)
vectorized_new_texts = vectorize_layer(new_texts)

# Predict with the model
predictions = model.predict(vectorized_new_texts)
print("Predictions:", predictions)

Predictions: [[0.5074568 ]
 [0.50704896]]
