# SMS Spam Detection Model Training

This notebook demonstrates how to:
1. Load and explore the SMS spam dataset
2. Preprocess the text data
3. Train a machine learning model
4. Evaluate the model performance
5. Save the model for use in Flask backend

## Dataset
We'll use the SMS Spam Collection dataset which contains SMS messages labeled as 'spam' or 'ham' (legitimate).

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import joblib
import os
from datetime import datetime

# Machine Learning libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline

# Text processing
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Download NLTK data (run once)
# nltk.download('stopwords')

print("Libraries imported successfully!")

## 1. Data Loading and Exploration

In [None]:
# Load the SMS Spam Collection dataset
# You can download it from: https://www.kaggle.com/uciml/sms-spam-collection-dataset
# Or use the sample data below

# Sample data for demonstration (replace with actual dataset)
sample_data = {
    'label': ['ham', 'spam', 'ham', 'spam', 'ham', 'spam', 'ham', 'spam'] * 100,
    'message': [
        'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
        'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&Cs apply 08452810075over18\'s',
        'U dun say so early hor... U c already then say...',
        'FreeMsg Hey there darling it\'s been 3 week\'s now and no word back! I\'d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv',
        'Even my brother is not like to speak with me. They treat me like aids patent.',
        'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.',
        'Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030',
        'SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info'
    ] * 100
}

# Create DataFrame
df = pd.DataFrame(sample_data)

# Display basic information
print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head())

print("\nLabel distribution:")
print(df['label'].value_counts())

print("\nDataset info:")
print(df.info())

## 2. Data Preprocessing

In [None]:
def preprocess_text(text):
    """
    Preprocess SMS text for machine learning
    """
    if pd.isna(text):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Apply preprocessing
df['processed_message'] = df['message'].apply(preprocess_text)

# Convert labels to binary (0 for ham, 1 for spam)
df['label_binary'] = df['label'].map({'ham': 0, 'spam': 1})

print("Preprocessing completed!")
print("\nSample processed messages:")
for i in range(3):
    print(f"Original: {df['message'].iloc[i][:100]}...")
    print(f"Processed: {df['processed_message'].iloc[i][:100]}...")
    print(f"Label: {df['label'].iloc[i]}\n")

## 3. Feature Extraction and Model Training

In [None]:
# Split the data
X = df['processed_message']
y = df['label_binary']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Training set spam ratio: {y_train.mean():.3f}")
print(f"Test set spam ratio: {y_test.mean():.3f}")

In [None]:
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(
    max_features=5000,  # Limit vocabulary size
    stop_words='english',  # Remove common English stop words
    ngram_range=(1, 2),  # Use unigrams and bigrams
    min_df=2,  # Ignore terms that appear in less than 2 documents
    max_df=0.95  # Ignore terms that appear in more than 95% of documents
)

# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(f"TF-IDF matrix shape: {X_train_tfidf.shape}")
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")

In [None]:
# Train multiple models and compare performance
models = {
    'Naive Bayes': MultinomialNB(),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

model_results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    # Train the model
    model.fit(X_train_tfidf, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test_tfidf)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Store results
    model_results[name] = {
        'model': model,
        'accuracy': accuracy,
        'predictions': y_pred
    }
    
    print(f"{name} Accuracy: {accuracy:.4f}")
    print(f"Classification Report for {name}:")
    print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

# Find the best model
best_model_name = max(model_results.keys(), key=lambda k: model_results[k]['accuracy'])
best_model = model_results[best_model_name]['model']
best_accuracy = model_results[best_model_name]['accuracy']

print(f"\nBest Model: {best_model_name} with accuracy: {best_accuracy:.4f}")

## 4. Model Evaluation and Visualization

In [None]:
# Create confusion matrix for the best model
best_predictions = model_results[best_model_name]['predictions']
cm = confusion_matrix(y_test, best_predictions)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.title(f'Confusion Matrix - {best_model_name}')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Model comparison
model_names = list(model_results.keys())
accuracies = [model_results[name]['accuracy'] for name in model_names]

plt.figure(figsize=(10, 6))
bars = plt.bar(model_names, accuracies, color=['skyblue', 'lightgreen', 'lightcoral'])
plt.title('Model Accuracy Comparison')
plt.xlabel('Models')
plt.ylabel('Accuracy')
plt.ylim(0.8, 1.0)

# Add accuracy values on bars
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.005, 
             f'{acc:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

## 5. Save Model for Flask Backend

In [None]:
# Create models directory
models_dir = '../backend/ml_model/models'
os.makedirs(models_dir, exist_ok=True)

# Save the best model and vectorizer
model_path = os.path.join(models_dir, 'spam_model.pkl')
vectorizer_path = os.path.join(models_dir, 'vectorizer.pkl')

joblib.dump(best_model, model_path)
joblib.dump(vectorizer, vectorizer_path)

print(f"Model saved to: {model_path}")
print(f"Vectorizer saved to: {vectorizer_path}")

# Save model metadata
metadata = {
    'model_type': best_model_name,
    'accuracy': best_accuracy,
    'training_date': datetime.now().isoformat(),
    'features': X_train_tfidf.shape[1],
    'training_samples': len(X_train),
    'test_samples': len(X_test),
    'vectorizer_params': vectorizer.get_params()
}

import json
metadata_path = os.path.join(models_dir, 'model_metadata.json')
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"Metadata saved to: {metadata_path}")

## 6. Test the Saved Model

In [None]:
# Load the saved model and test it
loaded_model = joblib.load(model_path)
loaded_vectorizer = joblib.load(vectorizer_path)

# Test with sample messages
test_messages = [
    "Hi, how are you doing today?",
    "FREE! Win a £1000 cash prize! Text WIN to 12345 now!",
    "Can you pick up some milk on your way home?",
    "URGENT! Your account will be closed. Click here immediately!",
    "Thanks for the great dinner last night!"
]

print("Testing the saved model:")
print("=" * 50)

for message in test_messages:
    # Preprocess the message
    processed = preprocess_text(message)
    
    # Vectorize
    features = loaded_vectorizer.transform([processed])
    
    # Predict
    prediction = loaded_model.predict(features)[0]
    probability = loaded_model.predict_proba(features)[0]
    confidence = max(probability)
    
    label = 'SPAM' if prediction == 1 else 'HAM'
    
    print(f"Message: {message}")
    print(f"Prediction: {label} (Confidence: {confidence:.3f})")
    print("-" * 50)

## 7. Integration with Flask Backend

The model and vectorizer are now saved and ready to be used in the Flask backend. The `SpamDetector` class in the Flask app will automatically load these files and use them for predictions.

### Key Files Created:
- `spam_model.pkl`: The trained machine learning model
- `vectorizer.pkl`: The TF-IDF vectorizer for text preprocessing
- `model_metadata.json`: Metadata about the model for tracking

### Usage in Flask:
The Flask backend will load these files automatically when the `SpamDetector` class is instantiated, and use them to make real-time predictions on SMS messages submitted through the API.

### Model Performance:
- **Accuracy**: {best_accuracy:.4f}
- **Model Type**: {best_model_name}
- **Features**: {X_train_tfidf.shape[1]} TF-IDF features

The model is now ready for production use in your SMS Guard application!