# 🧠 Spam / Ham Message Classifier (Enhanced)

Welcome to your custom spam classifier notebook! The goal of this notebook is to build, train, and export a machine learning model that can distinguish between 'spam' (unwanted messages) and 'ham' (legitimate messages).

This enhanced version includes:
- **Advanced Preprocessing**: Stopword removal and stemming.
- **Data Analysis**: Justification for padding length.
- **Robust Evaluation**: A full classification report and confusion matrix.
- **Live Demo**: A cell to test predictions on new messages.

### 📦 1. Install Dependencies

First, we need to install the necessary Python libraries. 
- `tensorflow`: The core machine learning library for building and training our model.
- `tensorflowjs`: A library to convert our trained TensorFlow model into a format that can run directly in a web browser.
- `nltk`: A library for natural language processing, used here for stopword removal and stemming.

In [None]:
!pip install tensorflow tensorflowjs nltk

### 📁 2. Load and Prepare the Dataset

Next, we'll load our dataset. We're using the 'SMSSpamCollection' dataset, which is a collection of SMS messages already labeled as either spam or ham.

**Action Required:** You'll need to upload the `SMSSpamCollection` file to your Colab environment. You can do this by clicking the **folder icon** on the left sidebar and then clicking the **upload button**.

In [None]:
import pandas as pd

# The dataset is a tab-separated file (.tsv), so we use sep='\t'.
# We also assign column names 'label' and 'message' for clarity.
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])

# Display the first 5 rows to verify it loaded correctly
df.head()

### 🧹 3. Enhanced Text Preprocessing

To improve accuracy, we'll perform more advanced text cleaning:
- **Tokenization**: Splitting sentences into words.
- **Stopword Removal**: Removing common words (like 'the', 'a', 'is') that don't add much meaning.
- **Stemming**: Reducing words to their root form (e.g., 'running' becomes 'run').

In [None]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('punkt')

stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = text.lower() # Convert to lowercase
    text = re.sub(r'\W', ' ', text) # Remove non-word characters
    text = re.sub(r'\s+', ' ', text) # Replace multiple spaces with a single space
    text = text.strip()
    
    # Tokenization, Stopword Removal, and Stemming
    tokens = word_tokenize(text)
    filtered_tokens = [stemmer.stem(word) for word in tokens if word not in stop_words]
    
    return " ".join(filtered_tokens)

df['cleaned'] = df['message'].apply(clean_text)
df.head()

### 📊 4. Analyze Message Length Distribution

To choose an optimal padding length (`maxlen`), it's helpful to understand the distribution of message lengths in our dataset. A fixed input length is required for our neural network, so we'll pad or truncate messages to fit. This plot helps us pick a value that covers most message lengths without adding excessive padding.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

message_lengths = df['cleaned'].apply(lambda x: len(x.split()))

plt.figure(figsize=(10, 6))
sns.histplot(message_lengths, bins=50, kde=True)
plt.title('Distribution of Message Lengths')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()

print("From the chart, we can see that the vast majority of messages have fewer than 50 words. Choosing maxlen=50 is a reasonable choice.")

### 🔢 5. Tokenize and Pad Sequences

Now we convert our preprocessed text into numerical sequences and ensure they all have the same length (50), as determined from our analysis above.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

MAX_VOCAB_SIZE = 5000
MAX_LEN = 50

# Use the top 5000 most frequent words for our vocabulary
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(df['cleaned'])

# Convert text to sequences of integers
X = tokenizer.texts_to_sequences(df['cleaned'])

# Pad all sequences to a fixed length of 50
X = pad_sequences(X, maxlen=MAX_LEN)

### 🎯 6. Encode Labels

We convert our 'ham' and 'spam' labels into numerical format: `ham = 0` and `spam = 1`.

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(df['label'])

print(f"Original labels: {df['label'].unique()}")
print(f"Encoded labels: {encoder.transform(df['label'].unique())}")

###  splitting the dataset into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

### 🧠 7. Build and Train the Model

We'll build our neural network and train it on the prepared training data. We use the test data as a validation set to monitor performance on unseen data during training.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D, Dense, Dropout

model = Sequential([
    Embedding(input_dim=MAX_VOCAB_SIZE, output_dim=16, input_length=MAX_LEN),
    GlobalAveragePooling1D(),
    Dense(16, activation='relu'),
    Dropout(0.1), # Add dropout for regularization
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

history = model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test), batch_size=32)

### 📈 8. Evaluate the Model

After training, we perform a final evaluation on our test set. We'll look at:
- **Loss & Accuracy**: Basic performance metrics.
- **Classification Report**: Detailed metrics like precision, recall, and F1-score for each class.
- **Confusion Matrix**: A table showing how many predictions were correct and incorrect for each class.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")

# Get predictions
y_pred_prob = model.predict(X_test)
y_pred = np.where(y_pred_prob > 0.5, 1, 0)

print("\n--- Classification Report ---")
print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

print("\n--- Confusion Matrix ---")
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### 🚀 9. Spam/Ham Prediction Demo

Let's test our trained model on some new, unseen messages. This function simulates the entire preprocessing pipeline and feeds the message to the model for a prediction.

In [None]:
def predict_message(msg):
    # Preprocess the message using the same steps as training
    cleaned_msg = clean_text(msg)
    seq = tokenizer.texts_to_sequences([cleaned_msg])
    padded = pad_sequences(seq, maxlen=MAX_LEN)
    
    # Get the prediction
    pred_prob = model.predict(padded)[0][0]
    prediction = "Spam 🚫" if pred_prob > 0.5 else "Ham ✅"
    
    return f"Prediction: {prediction} (Confidence: {pred_prob:.2f})"

# Test with a spam message
spam_message = "Congratulations! you have won a free lottery ticket worth $1000. call 12345 to claim now"
print(f"Message: '{spam_message}'")
print(predict_message(spam_message))

# Test with a ham message
ham_message = "Hi, can we meet tomorrow for the project discussion?"
print(f"\nMessage: '{ham_message}'")
print(predict_message(ham_message))

### 🌐 10. Export for TensorFlow.js

Finally, we'll save our trained model and the tokenizer's word index. These files are all you need to run the spam classifier on your website.

In [None]:
import tensorflowjs as tfjs
import json

# Save the model in TensorFlow.js format
tfjs.converters.save_keras_model(model, 'model')

# Save the tokenizer's word index
word_index = tokenizer.word_index
with open('word_index.json', 'w') as f:
    json.dump(word_index, f)

print("✅ Model and tokenizer saved for web deployment.")
print("\nDon't forget to download the 'model' directory and the 'word_index.json' file!")