# SMS Spam Detection Model - Complete Workflow

This notebook demonstrates the complete workflow for creating, training, and integrating an SMS spam detection model with the Flask backend.

## What you'll learn:
1. How to create and preprocess a dataset
2. How to train multiple ML models and compare them
3. How to save the trained model for use in Flask
4. How the Flask backend loads and uses the model
5. How to test the integration

In [None]:
# Install required packages (run this cell first)
!pip install pandas numpy scikit-learn matplotlib seaborn joblib

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import joblib
import os
from datetime import datetime
import json

# Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("✅ All libraries imported successfully!")

## Step 1: Create Sample Dataset

In a real project, you would use a dataset like the SMS Spam Collection. For this demo, we'll create a sample dataset.

In [None]:
# Create sample spam and ham messages
spam_messages = [
    "FREE! Win a £1000 cash prize! Text WIN to 12345 now!",
    "URGENT! Your account will be closed. Click here immediately!",
    "Congratulations! You've won a lottery! Call now to claim your prize!",
    "Limited time offer! Get 50% off on all products. Buy now!",
    "WINNER! You have been selected for a special reward. Claim now!",
    "Free entry to win a brand new iPhone! Text IPHONE to 54321",
    "Your loan has been approved! Get cash now with no credit check!",
    "STOP! You owe money. Pay now or face legal action!",
    "Amazing deal! Buy one get one free! Limited time only!",
    "You have won $10000! Click this link to claim your money!",
    "Free ringtones! Text RING to 12345 for unlimited downloads!",
    "Urgent: Your bank account has been compromised. Verify now!",
    "Get rich quick! Make $5000 per week working from home!",
    "Final notice: Your subscription will expire. Renew now!",
    "Exclusive offer just for you! 90% discount on luxury items!"
] * 15  # Multiply to get more samples

ham_messages = [
    "Hi, how are you doing today?",
    "Can you pick up some milk on your way home?",
    "Thanks for the great dinner last night!",
    "Meeting is scheduled for 3 PM tomorrow",
    "Happy birthday! Hope you have a wonderful day!",
    "Don't forget about the doctor's appointment",
    "The weather is really nice today, isn't it?",
    "I'll be running a bit late for our meeting",
    "Could you send me the report when you get a chance?",
    "Let's grab lunch together this weekend",
    "The project deadline has been extended to next week",
    "I really enjoyed the movie we watched yesterday",
    "Please call me when you get this message",
    "The package should arrive by Friday",
    "Looking forward to seeing you at the party!"
] * 15  # Multiply to get more samples

# Create DataFrame
messages = spam_messages + ham_messages
labels = ['spam'] * len(spam_messages) + ['ham'] * len(ham_messages)

df = pd.DataFrame({
    'message': messages,
    'label': labels
})

# Shuffle the dataset
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Dataset created with {len(df)} samples")
print(f"Spam messages: {len(df[df['label'] == 'spam'])}")
print(f"Ham messages: {len(df[df['label'] == 'ham'])}")
print("\nFirst few samples:")
print(df.head())

## Step 2: Data Preprocessing

Clean and prepare the text data for machine learning.

In [None]:
def preprocess_text(text):
    """Preprocess SMS text for machine learning"""
    if pd.isna(text):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters but keep letters, numbers, and spaces
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Apply preprocessing
df['processed_message'] = df['message'].apply(preprocess_text)

# Convert labels to binary (0 for ham, 1 for spam)
df['label_binary'] = df['label'].map({'ham': 0, 'spam': 1})

print("Preprocessing completed!")
print("\nExample of preprocessing:")
for i in range(3):
    print(f"Original: {df['message'].iloc[i]}")
    print(f"Processed: {df['processed_message'].iloc[i]}")
    print(f"Label: {df['label'].iloc[i]}\n")

## Step 3: Train Multiple Models

We'll train different models and compare their performance.

In [None]:
# Split the data
X = df['processed_message']
y = df['label_binary']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Training set spam ratio: {y_train.mean():.3f}")
print(f"Test set spam ratio: {y_test.mean():.3f}")

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(
    max_features=5000,
    stop_words='english',
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)

# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"\nTF-IDF matrix shape: {X_train_tfidf.shape}")
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")

In [None]:
# Train multiple models
models = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

model_results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train the model
    model.fit(X_train_tfidf, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_tfidf)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Store results
    model_results[name] = {
        'model': model,
        'accuracy': accuracy,
        'predictions': y_pred
    }
    
    print(f"{name} Accuracy: {accuracy:.4f}")

# Find the best model
best_model_name = max(model_results.keys(), key=lambda k: model_results[k]['accuracy'])
best_model = model_results[best_model_name]['model']
best_accuracy = model_results[best_model_name]['accuracy']

print(f"\n🏆 Best Model: {best_model_name} with accuracy: {best_accuracy:.4f}")

## Step 4: Visualize Results

In [None]:
# Plot model comparison
model_names = list(model_results.keys())
accuracies = [model_results[name]['accuracy'] for name in model_names]

plt.figure(figsize=(10, 6))
bars = plt.bar(model_names, accuracies, color=['skyblue', 'lightgreen', 'lightcoral'])
plt.title('Model Accuracy Comparison', fontsize=16, fontweight='bold')
plt.xlabel('Models', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.ylim(0.8, 1.0)

# Add accuracy values on bars
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, 
             f'{acc:.4f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

# Confusion matrix for the best model
best_predictions = model_results[best_model_name]['predictions']
cm = confusion_matrix(y_test, best_predictions)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.title(f'Confusion Matrix - {best_model_name}', fontsize=16, fontweight='bold')
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.show()

## Step 5: Save Model for Flask Backend

This is the crucial step - saving the model so Flask can use it.

In [None]:
# Create models directory
models_dir = '../backend/ml_model/models'
os.makedirs(models_dir, exist_ok=True)

# Save the best model and vectorizer
model_path = os.path.join(models_dir, 'spam_model.pkl')
vectorizer_path = os.path.join(models_dir, 'vectorizer.pkl')

joblib.dump(best_model, model_path)
joblib.dump(vectorizer, vectorizer_path)

print(f"✅ Model saved to: {model_path}")
print(f"✅ Vectorizer saved to: {vectorizer_path}")

# Save model metadata
metadata = {
    'model_type': best_model_name,
    'accuracy': float(best_accuracy),
    'training_date': datetime.now().isoformat(),
    'features': int(X_train_tfidf.shape[1]),
    'training_samples': int(len(X_train)),
    'test_samples': int(len(X_test)),
    'vectorizer_params': vectorizer.get_params()
}

metadata_path = os.path.join(models_dir, 'model_metadata.json')
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"✅ Metadata saved to: {metadata_path}")
print("\n🎉 Model is ready for Flask backend!")

## Step 6: Test the Saved Model

Let's verify that our saved model works correctly.

In [None]:
# Load the saved model and test it
loaded_model = joblib.load(model_path)
loaded_vectorizer = joblib.load(vectorizer_path)

# Test with sample messages
test_messages = [
    "Hi, how are you doing today?",
    "FREE! Win a £1000 cash prize! Text WIN to 12345 now!",
    "Can you pick up some milk on your way home?",
    "URGENT! Your account will be closed. Click here immediately!",
    "Thanks for the great dinner last night!",
    "Congratulations! You've won a lottery prize!",
    "Meeting is scheduled for 3 PM tomorrow"
]

print("Testing the saved model:")
print("=" * 60)

for message in test_messages:
    # Preprocess the message
    processed = preprocess_text(message)
    
    # Vectorize
    features = loaded_vectorizer.transform([processed])
    
    # Predict
    prediction = loaded_model.predict(features)[0]
    probability = loaded_model.predict_proba(features)[0]
    confidence = max(probability)
    
    label = '🚨 SPAM' if prediction == 1 else '✅ HAM'
    
    print(f"Message: {message}")
    print(f"Prediction: {label} (Confidence: {confidence:.3f})")
    print("-" * 60)

## Step 7: How Flask Uses This Model

Here's how the Flask backend loads and uses your trained model:

In [None]:
# This is similar to what happens in the Flask backend
print("Flask Backend Integration Example:")
print("=" * 50)

# Simulate the SpamDetector class from Flask
class FlaskSpamDetector:
    def __init__(self, model_path, vectorizer_path):
        self.model = joblib.load(model_path)
        self.vectorizer = joblib.load(vectorizer_path)
        print("✅ Model loaded in Flask backend")
    
    def predict(self, message):
        # Preprocess
        processed = preprocess_text(message)
        
        # Vectorize
        features = self.vectorizer.transform([processed])
        
        # Predict
        prediction = self.model.predict(features)[0]
        probabilities = self.model.predict_proba(features)[0]
        confidence = max(probabilities)
        
        return {
            'prediction': 'spam' if prediction == 1 else 'ham',
            'confidence': float(confidence)
        }

# Create detector instance (like Flask does)
flask_detector = FlaskSpamDetector(model_path, vectorizer_path)

# Test prediction (like API endpoint does)
test_message = "FREE! Win money now! Click here!"
result = flask_detector.predict(test_message)

print(f"\nAPI Request: POST /api/predict")
print(f"Message: {test_message}")
print(f"API Response: {result}")
print("\n✅ This is exactly how your Flask backend will work!")

## Summary

🎉 **Congratulations!** You've successfully:

1. ✅ Created and preprocessed a dataset
2. ✅ Trained multiple machine learning models
3. ✅ Selected the best performing model
4. ✅ Saved the model for Flask integration
5. ✅ Tested the saved model
6. ✅ Understood how Flask will use the model

### Next Steps:

1. **Start your Flask backend:**
   ```bash
   cd ../backend
   python run.py
   ```

2. **Test the API:**
   ```bash
   python test_api.py
   ```

3. **Connect your React frontend** to the Flask backend by updating the API base URL

### Model Performance:
- **Best Model:** {best_model_name}
- **Accuracy:** {best_accuracy:.4f}
- **Features:** {X_train_tfidf.shape[1]} TF-IDF features

Your SMS Guard application is now ready with a fully functional machine learning backend! 🚀