# Spam Detection Workshop
## Text Classification for SMS and Email Messages

**Learning Objectives:**
By completing this workshop, you will be able to:
- Understand fundamental concepts of Natural Language Processing (NLP)
- Handle text preprocessing and tokenization strategies
- Apply feature extraction techniques for text classification
- Implement and evaluate machine learning models for text data
- Compare model performance across different datasets and scenarios
- Understand the challenges of domain transfer in text classification

**Context:**
Spam detection is a critical application of text classification that helps protect users from unwanted messages. This workshop uses two different datasets (SMS and email messages) to explore how machine learning models perform across different text domains and communication channels.

Unlike numerical data, text requires special preprocessing steps including tokenization, feature extraction, and encoding before machine learning algorithms can process it effectively.


***
# 1. Library Setup and Data Loading

Let's start by importing the necessary libraries for text processing and machine learning.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import re
import numpy as np
from sklearn.linear_model import LogisticRegression
from matplotlib.pyplot import plot, show

RANDOM_STATE = 3
TRAIN_TEST_SPLIT_SIZE = 0.2

***
# 2. Read the input data and check its sanity
We have two annotated corpora:
* SMS messages and their classes;
* email messages and their classes.

Understanding the characteristics and quality of our datasets is essential before building models. We need to check for duplicate entries, data imbalance, and basic statistics to ensure robust model training.

## 2.1 Initial Data Loading and Exploration

In [None]:
# Load the datasets
sms_data = pd.read_csv('sms_spam.csv',sep=';')
email_data = pd.read_csv('email_spam.csv')

**Exercise:** Check for duplicate entries and data quality issues

Duplicate entries can artificially inflate performance if the same message appears in both training and test sets. Removing duplicates ensures fair evaluation and prevents data leakage.

**Documentation references:**
- [pandas.DataFrame.drop_duplicates()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html)
- [Data quality assessment](https://pandas.pydata.org/docs/user_guide/duplicates.html)

Use the `drop_duplicates()` method to remove any duplicate entries from both datasets. Set `inplace=True` and `ignore_index=True` to modify the datasets directly and reset row indices.

In [None]:
# TODO Check for the existence of duplicate entries and eliminate them if necessary.


In [None]:
# Extract messages and labels for easier handling
sms_messages = sms_data["message"]
sms_labels = sms_data["label"]
email_messages = email_data["message"]
email_labels = email_data["label"]

## 2.2 Dataset Splitting Strategies

Different experimental scenarios require different data splitting approaches. We'll implement four strategies to explore various aspects of text classification performance:

1. **Train/Test on SMS**: Standard evaluation within SMS domain
2. **Train/Test on Email**: Standard evaluation within email domain  
3. **Transfer Learning**: Train on SMS, test on email (domain adaptation)
4. **Combined Training**: Train and test on merged datasets

**Exercise:** Implement SMS dataset splitting function

Standard train-test splitting allows us to evaluate model performance within the SMS domain. This provides a baseline for comparison with other scenarios.

**Documentation references:**
- [train_test_split documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- [Random state for reproducibility](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split)

In [None]:
# TODO Implement the function
def train_eval_sms():
    """
    Split SMS dataset into training and testing sets.
    
    Creates a standard train-test split for SMS spam detection evaluation.
    Uses stratified sampling to maintain class distribution across splits.
    
    Returns
    -------
    tuple
        Tuple containing (train_messages, test_messages, train_labels, test_labels)
        
    Notes
    -----
    Uses global TRAIN_TEST_SPLIT_SIZE and RANDOM_STATE for consistency
    across all experiments.
    """


**Exercise:** Implement email dataset splitting function

Similar to SMS splitting, this function enables evaluation within the email domain to establish baseline performance for email spam detection.

In [None]:
# TODO: Implement the function
def train_eval_email():
    """
    Split email dataset into training and testing sets.
    
    Creates a standard train-test split for email spam detection evaluation.
    Maintains consistent splitting parameters with SMS evaluation for fair comparison.
    
    Returns
    -------
    tuple
        Tuple containing (train_messages, test_messages, train_labels, test_labels)
    """


**Exercise:** Implement cross-domain transfer function

Domain transfer testing reveals how well models trained on one type of text (SMS) perform on another (email). This scenario is common in real-world applications where training data and deployment contexts differ.

In [None]:
# TODO: Implement the train_sms_eval_email function to train on SMS data and evaluate on email data
def train_sms_eval_email():
    """
    Prepare data for cross-domain transfer learning experiment.
    
    Uses entire SMS dataset for training and entire email dataset for testing.
    This setup evaluates model generalization across different text domains
    and communication channels.
    
    Returns
    -------
    tuple
        Tuple containing (train_messages, test_messages, train_labels, test_labels)
        where training data comes from SMS and testing data from email
        
    Notes
    -----
    No random splitting is performed as we use complete datasets for 
    cross-domain evaluation.
    """


**Exercise:** Implement combined dataset function

Training on combined data tests whether mixing domains improves overall performance and provides insights into dataset complementarity for spam detection.

**Documentation references:**
- [pandas.concat() for combining datasets](https://pandas.pydata.org/docs/reference/api/pandas.concat.html)

In [None]:
# TODO: Implement the train_eval_combined function to combine SMS and email data for training and testing
def train_eval_combined():
    """
    Combine SMS and email datasets for unified training and testing.
    
    Merges both datasets and creates a mixed train-test split. This approach
    evaluates whether combining different text domains improves overall
    spam detection performance.
    
    Returns
    -------
    tuple
        Tuple containing (train_messages, test_messages, train_labels, test_labels)
        from the combined dataset
        
    Notes
    -----
    Uses pandas.concat to merge datasets while preserving all data points.
    Maintains class balance across the combined dataset.
    """


In [None]:
# Initialize with combined dataset for demonstration
training_messages, testing_messages, training_labels, testing_labels = train_eval_combined()

***
# 3. Data Balancing and Class Distribution

Class imbalance is a common problem in spam detection where spam messages are typically much less frequent than legitimate messages. Imbalanced datasets can lead to biased models that perform poorly on minority classes. We need to address this issue to ensure fair evaluation and robust model performance.

## 3.1 Class Imbalance Detection and Correction

Balanced training data ensures that models learn both spam and non-spam patterns equally well. Oversampling the minority class is a simple and effective approach for text classification.

**Documentation references:**
- [pandas.DataFrame.sample()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html)
- [Class imbalance handling techniques](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html)

In [None]:
def balance(training_messages, training_labels):
    """
    Balance training data by oversampling the minority class.
    
    Addresses class imbalance by randomly sampling additional instances
    from the underrepresented class until both classes have equal frequency.
    This prevents model bias toward the majority class.
    
    Parameters
    ----------
    training_messages : pandas.Series
        Training text messages
    training_labels : pandas.Series  
        Corresponding class labels (0 for ham, 1 for spam)
        
    Returns
    -------
    tuple
        Tuple containing (balanced_messages, balanced_labels) with equal
        class representation
        
    Notes
    -----
    Uses random sampling with replacement to increase minority class size.
    Preserves original data distribution while achieving balance.
    """
    print("Label counts before balancing:")
    print(training_labels.value_counts())
    
    counts = training_labels.value_counts()
    if counts[1] > counts[0]:
        label_to_oversample = 0
        diff = counts[1] - counts[0]
    else:
        label_to_oversample = 1
        diff = counts[0] - counts[1]
    
    training_data = pd.concat([training_messages, training_labels], axis=1)
    draw_from = training_data[training_data["label"] == label_to_oversample]
    
    for i in range(diff):
        sample = draw_from.sample(random_state=RANDOM_STATE)
        training_data = pd.concat([training_data, sample])
    
    training_messages = training_data["message"]
    training_labels = training_data["label"]
    
    print("Label counts after balancing:")
    print(training_labels.value_counts())
    return training_messages, training_labels

**Exercise:** Apply balancing to training data

Check if the current training data is balanced and apply correction if needed. Balanced training data is crucial for fair model evaluation and optimal performance on both classes.

In [None]:
# TODO: Check if our training data is balanced and apply balancing if necessary


***
# 4. Text Preprocessing and Feature Extraction

## 4.1 Understanding Tokenization Strategies

We will take the simplest approach possible: our input features will be the most frequent words and their frequencies. The idea is that the words that appear frequently in a document are characteristic of its content, and thus of its spam-ness.

The **CountVectorizer** class of scikit-learn will do exactly this for us. It first counts word frequencies across *all* messages, in order to find the overall most frequent ones. Then, it counts the occurrences of these most frequent words in each message, computing a frequency vector per message, where each dimension of the vector corresponds to a frequent word.

Firstly, what does *most frequent word* mean? We will define a threshold, which we will call the **number of features**.

Secondly, what is a word? *Word* is not a term from linguistics, it has no scientific definition.
* Is "hazelnuts" one word, two words, or three words? "hazel", "nut", and "-s" are what linguists call *morphemes*: the elementary units of meaning, but in common language "hazelnuts" would be considered as a single word.
* Is "$12.50" a single word? It consists of a currency symbol and a rational number.
* Is "Joe's" one or two words?
* etc.

A pragmatic choice is not to use the term "word" but rather the term "token". A token can be whatever unit into which we decide to split our input text. We call **tokenization** the process of splitting a text into tokens. **Beware: the choice of splitting rule will determine the performance of downstream tasks.** CountVectorizer has a **token_pattern** parameter that takes a regular expression as an input string. Instead of blindly trusting whatever default tokenization method offered by CountVectorizer, let us define our own rule. A few possibilities:
* split by whitespace;
* split by whitespace or punctuation;
* keep only tokens that contain letters or digits;
* keep only tokens of length > X (where you choose X);
* etc.

Furthermore, it is common to perform additional preprocessing to the input text, always depending on the requirements of the downstream task:
* in some cases, converting to all-lowercase may improve results (e.g. "WIN" and "win" are collapsed into a single feature), but it may also result in losing useful information (is all-caps characteristic of spam?).
* In bag-of-word models, removing so-called *stop words* helps eliminate frequent grammatical words that bear little relevant meaning (e.g. articles, pronouns, modal verbs, prepositions). Beware, as models that rely on syntax (i.e. phrases) do need grammar words: stop words should not be eliminated systematically.
* The presence of numbers (e.g. phone numbers, money amounts) in the text may be an important feature to detect spam. However, each distinct number will be considered by CountVectorizer as a different token, and as phone numbers (sums of money, etc.) tend not to repeat across messages, the classifier will not be able to generalize over them. You may implement generalisation manually by detecting tokens that only contain digits and replacing them by a general "<NUMBER>" token, for example.     

CountVectorizer is a powerful tool that has built-in support for both lowercase conversion and removal of English stop words, so you do not need to implement these preprocessing operations by hand. 

**Documentation references:**
- [Text feature extraction guide](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction)
- [CountVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- [Regular expressions for tokenization and for finding numbers](https://docs.python.org/3/library/re.html#regular-expression-syntax)

**Preprocess numbers appearing in messages**

Replace numbers such as "5000" or "0612345678" by a general "<NUM>" token, helping the classifier generalise over such tokens. You can use the `re.sub` method to find patterns and replace them in strings. 

In [None]:
# TODO: implement a preprocessing method that replaces numbers by a generic placeholder, and possibly other simplifications
def preprocess(message):


**Configure tokenization and feature extraction parameters**

The `token_pattern` parameter controls how text is split into tokens. Different patterns can significantly impact model performance by determining which linguistic units are considered as features.

**Exercise:** experiment with different regular expressions for tokenization and  with different numbers of features. For example, the regex `"([A-Za-z0-9][A-Za-z0-9]+)"` will only extract tokens that are at least two characters long and that contain only numbers or letters between a and z.

In [None]:
# Define the regular expression that extracts tokens.
# Within the regex, there should be exactly one parenthesised expression
# that will capture the token to be extracted.

# The following example extracts series of non-whitespace characters.
TOKEN_REGEX = r"(\S+)"
NB_FEATURES = 5000

**Exercise:** Initialize text vectorization with CountVectorizer

CountVectorizer converts text documents into numerical feature vectors by counting token occurrences. It first builds a vocabulary from the most frequent tokens, then represents each document as a vector of token counts.

**Key parameters:**
- `max_features`: Limits vocabulary size to most frequent tokens
- `token_pattern`: Regular expression defining what constitutes a token
- `lowercase`: Whether to convert text to lowercase before tokenization
- `stop_words`: Whether to remove common English stop words
- `preprocessor`: calls your custom text preprocessor function given as argument (warning: it overrides the `lowercase` setting!)

Initialize CountVectorizer with the defined parameters to prepare for feature extraction.

In [None]:
# TODO Call CountVectorizer with the parameters you wish to use


## 4.2 Feature Matrix Creation

The vectorization process has two phases:
- **Fit**: Analyzes training text to build vocabulary of most frequent tokens
- **Transform**: Converts text documents into numerical feature vectors using the learned vocabulary

**Documentation references:**
- [Fit vs Transform in scikit-learn](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)
- [Feature extraction workflow](https://scikit-learn.org/stable/modules/feature_extraction.html#the-bag-of-words-representation)

**Exercise:** Create feature matrices for training and testing

Use `fit_transform()` on training data to learn vocabulary and create features simultaneously:
- "fit" computes the *vocabulary* consisting of the most frequent tokens.
- "transform" computes the *frequencies* of tokens in the vocabulary, which will be our input features. 

Use `transform()` on testing data to convert it using the same vocabulary learned from training, ensuring consistency between training and testing representations.

You can use `vectorizer.get_feature_names_out()` to obtain the list of features (= most frequent tokens) extracted.

In [None]:
# TODO Fit and transform with the vectorizer on the training messages


In [None]:
# TODO Do transform with the vectorizer on the testing messages


***
# 5. Model Training with Logistic Regression

We will use one of the simplest and fastest machine learning models that exist: a **logistic regression classifier**.

Logistic regression is a binary classifier, which suits our task well. The only input hyperparameter we will use is the number of iterations.

We could also use other classifiers, such as an SVM, but the goal of this lab is to get familiar with a few fundamental notions of natural language processing, not to find the best machine learning method.

## 5.1 Model Configuration and Training

**Documentation references:**
- [Logistic Regression documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- [Text classification with scikit-learn](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

In [None]:
# Model hyperparameters
NB_ITERATIONS = 1000

**Exercise:** Initialize and train the logistic regression model

Configure Logistic Regression with sufficient iterations to ensure convergence on high-dimensional text features. Train the model on the preprocessed feature matrix to learn patterns distinguishing spam from legitimate messages.

In [None]:
# TODO Instantiate a logistic regression model with the number of iterations as an input hyperparameter


In [None]:
# TODO Train the model


***
# 6. Model Evaluation and Metrics

Evaluation metrics provide different perspectives on model performance. For spam detection, we need to understand not just overall accuracy but also how well the model identifies spam (precision) and how many spam messages it catches (recall).

**Confusion Matrix Concepts:**

|                | Predicted Ham | Predicted Spam |
|----------------|:------------:|:-------------:|
| **Actual Ham** | TN           | FP            |
| **Actual Spam**| FN           | TP            |

Where:
- **True Positives (TP)**: Correctly identified spam
- **True Negatives (TN)**: Correctly identified ham  
- **False Positives (FP)**: Ham incorrectly labeled as spam
- **False Negatives (FN)**: Spam incorrectly labeled as ham

**Documentation references:**
- [Classification metrics guide](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)
- [Confusion matrix interpretation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

### 6.1.1 Accuracy Implementation

**Exercise:** Implement accuracy calculation from scratch

Accuracy measures the proportion of correct predictions (both spam and ham) out of all predictions. While intuitive, accuracy can be misleading with imbalanced datasets where a model could achieve high accuracy by always predicting the majority class.

**Formula:** Accuracy = (TP + TN) / (TP + TN + FP + FN)

In [None]:
# TODO: Compute accuracy by comparing truth and predicted labels
def accuracy_score(truth, pred):
    """
    Calculate accuracy as the proportion of correct predictions.
    
    Accuracy measures overall correctness but may not reflect performance
    on individual classes, especially with imbalanced datasets.
    
    Parameters
    ----------
    truth : array-like
        Ground truth labels (0 for ham, 1 for spam)
    pred : array-like  
        Predicted labels from the model
        
    Returns
    -------
    float
        Accuracy score between 0 and 1, where 1 indicates perfect accuracy
        
    Notes
    -----
    Accuracy alone may be misleading for imbalanced datasets where
    a model could achieve high accuracy by always predicting the majority class.
    """


### 6.1.2 Precision Implementation

**Exercise:** Implement precision calculation from scratch

Precision measures the proportion of predicted spam that is actually spam. High precision means few false alarms (legitimate messages incorrectly flagged as spam), which is crucial for user experience.

**Formula:** Precision = TP / (TP + FP)

**Documentation references:**
- [Precision definition](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html)

In [None]:
# TODO: Compute precision by calculating true positives and false positives
def precision_score(truth, pred, pos_label):
    """
    Calculate precision for the positive class (spam).
    
    Precision measures the proportion of predicted spam messages that are
    actually spam. High precision indicates few false positive errors
    (legitimate messages incorrectly classified as spam).
    
    Parameters
    ----------
    truth : array-like
        Ground truth labels
    pred : array-like
        Predicted labels from the model  
    pos_label : int or str
        Label that represents the positive class (spam)
        
    Returns
    -------
    float
        Precision score between 0 and 1, where 1 indicates perfect precision
        
    Notes
    -----
    Precision is especially important in spam detection to minimize
    false positives that could cause users to miss important messages.
    """


### 6.1.3 Recall Implementation

**Exercise:** Implement recall calculation from scratch

Recall measures the proportion of actual spam that the model correctly identifies. High recall means the model catches most spam messages, which is important for protecting users from unwanted content.

**Formula:** Recall = TP / (TP + FN)

**Documentation references:**
- [Recall definition](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html)

In [None]:
# TODO: Compute recall by calculating true positives and false negatives
def recall_score(truth, pred, pos_label):
    """
    Calculate recall for the positive class (spam).
    
    Recall measures the proportion of actual spam messages that the model
    correctly identifies. High recall indicates the model catches most
    spam with few false negative errors.
    
    Parameters
    ----------
    truth : array-like
        Ground truth labels
    pred : array-like
        Predicted labels from the model
    pos_label : int or str  
        Label that represents the positive class (spam)
        
    Returns
    -------
    float
        Recall score between 0 and 1, where 1 indicates perfect recall
        
    Notes
    -----
    Recall is crucial in spam detection to ensure most unwanted messages
    are filtered out, protecting users from spam content.
    """


## 6.2 Model Prediction and Performance Analysis

**Exercise:** Generate predictions and calculate comprehensive metrics

Use the trained model to make predictions on the test set, then calculate all three metrics to get a complete picture of model performance. Compare these metrics to understand the trade-offs between accuracy, precision, and recall.

In [None]:
# TODO Generate predictions using the trained model


In [None]:
# TODO Calculate all performance metrics


In [None]:
print("Accuracy : " + str(acc))
print("Precision: " + str(prec))
print("Recall   : " + str(rec))


***
# 7. Experimental Scenarios and Comparative Analysis

The following exercises guide you through different experimental scenarios to understand how text classification models perform across domains and datasets. Each scenario reveals different aspects of model generalization and domain adaptation.


## 7.1 Single-Domain Experiments

**Exercise:** Run experiments on SMS dataset only

Modify the data loading section to use `train_eval_sms()` instead of the combined dataset. Observe how the model performs when trained and tested on the same text domain (SMS messages).

**Questions to consider:**
- How does performance compare to the combined dataset results?
- Which metrics show the most significant changes?
- What might explain any performance differences?


**Exercise:** Run experiments on email dataset only

Switch to using `train_eval_email()` to train and test exclusively on email data. Compare results with the SMS-only experiment.

**Questions to consider:**
- Do emails and SMS messages show similar classification difficulty?
- Which dataset appears more challenging for spam detection?
- How do the optimal features differ between domains?


## 7.2 Cross-Domain Transfer Learning

**Exercise:** Train on SMS, evaluate on email dataset

Use `train_sms_eval_email()` to explore domain transfer performance. This simulates a realistic scenario where you have labeled data from one domain but need to deploy in another.

**Questions to consider:**
- How much does performance degrade when transferring across domains?
- Which metrics are most affected by domain mismatch?
- What linguistic differences between SMS and email might explain the results?


## 7.3 Combined Dataset Analysis

**Exercise:** Train and evaluate on combined datasets

Return to using `train_eval_combined()` to assess whether mixing domains during training improves overall robustness.

**Questions to consider:**
- Does combined training improve generalization across both domains?
- How do results compare to single-domain experiments?
- What are the trade-offs of mixed-domain training?

## 7.4 Add README file

**Exercise:** Create a professional README file documenting your spam detection analysis

Based on your experimental results across all scenarios (SMS-only, email-only, cross-domain transfer, and combined datasets), create a comprehensive README.md file that summarizes your key findings, methodology, and performance comparisons. Include quantitative results, optimal preprocessing configurations, and practical deployment recommendations as you learned before.

