Question 10: Imagine you’re working as a data scientist for a company that handles email communications.

Your task is to automatically classify emails as Spam or Not Spam. The emails may contain:

● Text with diverse vocabulary

● Potential class imbalance (far more legitimate emails than spam)

● Some incomplete or missing data

Explain the approach you would take to:

● Preprocess the data (e.g. text vectorization, handling missing data)

● Choose and justify an appropriate model (SVM vs. Naïve Bayes)

● Address class imbalance

● Evaluate the performance of your solution with suitable metrics

And explain the business impact of your solution.

(Include your Python code and output in the code box below.)

  *  Answer:  explaining the approach to a Spam Detection problem using Python and best practices in machine learning.

  1. Preprocessing the Data

(a) Text Cleaning & Tokenization:

* Lowercasing

* Removing stopwords, punctuation

* Stemming or lemmatization (optional)

(b) Vectorization:

* Use TF-IDF Vectorizer to convert text into numerical format:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(email_texts)


c. Handling Missing Data:

Remove rows with completely missing emails.

If text is missing but label exists: drop or impute as empty string:

In [None]:
emails_df['text'] = emails_df['text'].fillna("")


2. Model Selection: SVM vs. Naïve Bayes

(A) SVM Model:


1.   Pros

*   Effective in high-dimensional space
*   Works well for both linear and non-linear problem

*   Robust to overfitting
*   Only support vactors are used

*   Well defined objective function

2.   Cons



*   Computationally intensive
*   Hard to tune
*   Does not scale to big data
*   No probobilistic output by default
*   Sensitive to outliers

(B) Naive Bayes:



1.   Pros

*   Fast,great with high-dimensional sparse data (like text), handles word independence assumption well.


2.   Cons



*   Assumes word independence, which is not always realistic.

3. Handling Class Imbalance

Use class weights (e.g., class_weight='balanced' in SVM)

Or use oversampling (like SMOTE) or undersampling

Or adjust decision thresholds based on precision-recall tradeoff

In [None]:
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight(class_weight='balanced', classes=[0,1], y=labels)


4. Evaluation Metrics

Since spam detection is a class-imbalanced binary classification task, accuracy is not reliable.

(A) Use:

* Precision: How many predicted spams were actually spam?

* Recall: How many actual spams were correctly predicted?

* F1-Score: Balance of precision and recall.

* ROC-AUC: For probability-based ranking of spam likelihood.

5. Business Impact

An effective spam filter:

* Protects users from phishing or malicious content.

* Reduces support costs from spam-related complaints.

* Improves user trust and satisfaction, increasing engagement and retention.

* Prevents legal issues (e.g., compliance with anti-spam laws).

In [5]:
# Python Code Example

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, roc_auc_score
import pandas as pd

# Simulated example data
data = {
    'text': [
        'Win money now!!!', 'Important meeting today', '', 'Free lottery tickets',
        'Project deadline is tomorrow', 'Click here to claim your prize',
        'Let’s catch up over coffee', 'Get rich fast with this simple trick'
    ],
    'label': [1, 0, 0, 1, 0, 1, 0, 1]  # 1 = Spam, 0 = Not Spam
}

# Load into DataFrame
df = pd.DataFrame(data)

# Fill missing text with empty strings
df['text'] = df['text'].fillna("")

# Vectorize text using TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['text'])
y = df['label']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Naïve Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Output evaluation metrics
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("ROC-AUC Score:", roc_auc_score(y_test, y_prob))


Classification Report:
              precision    recall  f1-score   support

           0       0.33      1.00      0.50         1
           1       0.00      0.00      0.00         2

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3

ROC-AUC Score: 0.75


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
