# EE 467 Lab 1: ML Pipeline for Spam Detection

In this lab, we will go through the process of a typical machine learning task, and apply it to a cyber-security problem. We will build a binary classifier that detects spam emails. Like previous lab, we will leave out some code for you to complete. Refer to API references and search on Google for usage of libraries and functions. Refer to previous labs and search on Google for usage of libraries and functions, and ask TA or Instructor if you don't really have a clue.

Before working on the code, we will need to install `NLTK` and `scikit-learn` for this lab:

In [6]:
%pip install nltk scikit-learn



And ensure the dataset is extracted from the archive:

In [7]:
# Extract data
!tar -xf emails.tar.xz

tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.macl'


Then import the libraries we will use here:

In [2]:
# =============================================================================
# IMPORT REQUIRED LIBRARIES
# =============================================================================
# string   - Python's built-in module for string operations (punctuation list)
# numpy    - Numerical computing (we use 'np' as the standard alias)
# pandas   - Data manipulation and analysis (we use 'pd' as the standard alias)
# nltk     - Natural Language Toolkit for text processing
# =============================================================================

import string

import numpy as np
import pandas as pd

# NLTK (Natural Language Toolkit) - the most popular Python library for NLP
import nltk
from nltk.corpus import stopwords

# Download stop words (common words like "the", "a", "is" that add no meaning)
# These need to be downloaded once before use
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /home/main/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Pre-processing

All machine learning tasks begin with the **pre-processing** step, during which we load the dataset into memory and "clean" the data so that they are suitable for subsequent steps. For spam email detection task, here we will load all emails into the memory, tokenize each email into a list of words and then remove words that are useless for analysis.

All emails are stored in `emails.csv` under the same directory as this notebook. Feel free to open the file, take a look and get familiar with the format of the email dataset, then go back here to load the data.

In [3]:
# =============================================================================
# LOADING THE DATASET
# =============================================================================
# pd.read_csv() reads a CSV file and returns a DataFrame
# A DataFrame is like a spreadsheet - rows are samples, columns are features
# =============================================================================

# Load email dataset into a DataFrame
df = pd.read_csv("emails.csv")

# Preview first 5 rows
print(df.head(5), "\n")

                                                text  spam
0  Subject: naturally irresistible your corporate...     1
1  Subject: the stock trading gunslinger  fanny i...     1
2  Subject: unbelievable new homes made easy  im ...     1
3  Subject: 4 color printing special  request add...     1
4  Subject: do not have money , get software cds ...     1 



In [4]:
# Check dataset size and columns
print("Shape:", df.shape)      # (rows, columns)
print("Columns:", df.columns)  # 'text' = email, 'spam' = label (1=spam, 0=ham)

Shape: (5728, 2)
Columns: Index(['text', 'spam'], dtype='object')


In [5]:
df.drop_duplicates(inplace=True)

In [6]:
# Number of missing (NAN, NaN, na) data for each column
df.isnull().sum()

text    0
spam    0
dtype: int64

After loading the email dataset into memory, we will need to remove punctuations and stop words from these emails. Stop words are common, useless words that should be ignored in analysis (such as a, an, the, ...).

In [7]:
# Text tokenizer: removes punctuation and stop words
def process_text(text):
    """Convert email text to list of meaningful words."""

    # Remove punctuation (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~)
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join(nopunc)

    # Remove stop words ("the", "a", "is", etc.) - case insensitive
    clean_words = [word for word in nopunc.split()
                   if word.lower() not in stopwords.words('english')]

    return clean_words

In [8]:
# Preview the result of tokenization
df['text'].head().apply(process_text)

0    [Subject, naturally, irresistible, corporate, ...
1    [Subject, stock, trading, gunslinger, fanny, m...
2    [Subject, unbelievable, new, homes, made, easy...
3    [Subject, 4, color, printing, special, request...
4    [Subject, money, get, software, cds, software,...
Name: text, dtype: object

## Feature Extraction

We have obtained semi-structured tokenized email texts in the pre-processing step; however, machine learning algorithms usually operate on fully-structured numerical features. Hence, we need to find a way to convert the email texts to numeric vectors. This process is called **feature extraction**, and is necessary in data mining and analysis tasks where input data is semi-structured or even unstructured. In the following part we will make use of `scikit-learn`, which is a library for classic machine learning and feature extraction.

We will use **token count features** to represent the characteristics of each email. This turns a piece of text into a vector, each dimension of which contains the number of occurance of a particular word. In practice, we process many texts at once and end up getting a token count matrix. Below is simple demo on a toy dataset with only two emails:

In [9]:
# DEMO: Bag-of-Words converts text → word count vectors

message4 = 'hello world hello hello world play'
message5 = 'test test test test one hello'

from sklearn.feature_extraction.text import CountVectorizer

# CountVectorizer: text → matrix where each column = a word, values = counts
cv = CountVectorizer(analyzer=process_text)
bow4 = cv.fit_transform([[message4], [message5]])

In [10]:
# Vocabulary = unique words (these become column names)
print(cv.get_feature_names_out(), "\n")

# Count matrix: rows = documents, columns = word counts
print(bow4.toarray(), "\n")

['hello' 'one' 'play' 'test' 'world'] 

[[3 0 1 0 2]
 [1 1 0 4 0]] 



In [11]:
# Sparse format: only stores non-zero values (saves memory)
print(bow4, type(bow4), "\n")

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 6 stored elements and shape (2, 5)>
  Coords	Values
  (0, 0)	3
  (0, 4)	2
  (0, 2)	1
  (1, 0)	1
  (1, 3)	4
  (1, 1)	1 <class 'scipy.sparse._csr.csr_matrix'> 



Now let's compute and store token count matrix for real data:

## Create bag-of-words matrix for all emails

In this step, you will convert the email **text content** into a **Bag-of-Words (BoW)** representation using `CountVectorizer`.

✅ **Important note:**  
In the in-class demo, we used `CountVectorizer(analyzer=process_text)`, where `process_text` performs custom text processing.  
That approach can be **slow** and may produce **many printed outputs** because the custom analyzer shows intermediate processing steps.

For this lab, we will use a simpler and faster approach by letting `CountVectorizer` handle the tokenization internally, and we will enable English stop-word removal using:

- `stop_words="english"`

➡️ Your task: apply `CountVectorizer(stop_words="english")` on the `text` column and store the result in `messages_bow` as a **sparse matrix**.


In [12]:
cv = CountVectorizer(stop_words="english")
messages_bow = cv.fit_transform(df["text"])
print(messages_bow)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 505786 stored elements and shape (5695, 36996)>
  Coords	Values
  (0, 32145)	1
  (0, 23219)	1
  (0, 18705)	1
  (0, 9986)	1
  (0, 17562)	1
  (0, 21006)	1
  (0, 27817)	1
  (0, 16546)	1
  (0, 27941)	1
  (0, 9223)	3
  (0, 21520)	2
  (0, 32408)	1
  (0, 18103)	1
  (0, 18751)	1
  (0, 15964)	2
  (0, 7986)	1
  (0, 20818)	3
  (0, 32126)	1
  (0, 31776)	1
  (0, 24679)	1
  (0, 35805)	2
  (0, 21296)	2
  (0, 32839)	1
  (0, 12539)	1
  (0, 26937)	2
  :	:
  (5694, 24659)	2
  (5694, 21490)	1
  (5694, 5683)	9
  (5694, 30755)	1
  (5694, 2807)	3
  (5694, 13246)	1
  (5694, 13036)	1
  (5694, 17257)	1
  (5694, 14028)	1
  (5694, 20137)	1
  (5694, 31635)	1
  (5694, 13037)	1
  (5694, 20329)	1
  (5694, 35066)	1
  (5694, 8557)	1
  (5694, 29914)	1
  (5694, 13428)	5
  (5694, 35964)	1
  (5694, 943)	2
  (5694, 2776)	1
  (5694, 30109)	1
  (5694, 17456)	1
  (5694, 33710)	1
  (5694, 10293)	1
  (5694, 11304)	1


## Training

Now that we have loaded and pre-processed the email dataset, it's time to **train** a classifier model that does the job. First, we will split the email dataset into a 80% **training set** and a 20% **test set**. Each set will contain sample features as well as corresponding labels.

In [13]:
from sklearn.model_selection import train_test_split

# Split the data into 80% training (X_train & y_train)
# and 20% testing (X_test & y_test) data sets
X_train, X_test, y_train, y_test = train_test_split(messages_bow, df['spam'], test_size = 0.20, random_state = 0)

Then, we train a **logistic regression** classifier on the training set. We determine the class of the sample through its probability which is computed from the following formula:

$$
P(Y = 1|X = x) = \frac{e^{\mathbf{X}^T \mathbf{b}}}{(1+e^{\mathbf{X}^T \mathbf{b}})} \\
P(Y = 0|X = x) = 1 - P(Y = 1|X = x)
$$

Where $\mathbf{b}$ is a trainable vector. During training, we will try to maximize the **cross entropy loss** by performing **stochastic gradient descent** on parameter $\mathbf{b}$:

$$
l_{CE} = -(y \log P(Y = 1|X = x) + (1 - y) \log P(Y = 0|X = x))
$$

In [14]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(random_state=0)
classifier.fit(X=X_train, y=y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",0
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


## Evaluation

Finally, we need to determine how good our classification model is. This is known as **evaluation**. We will use our trained model to make predictions for both training and testing data, and calculate various metrics with the predictions and actual labels.

In [15]:
# Print predictions on training data
# `predict` function compute model predictions from input data
print("Training prediction:\n", classifier.predict(X_train), "\n")

# Print the actual labels
print("Training actual:\n", y_train.values, "\n")

Training prediction:
 [0 0 0 ... 0 0 0] 

Training actual:
 [0 0 0 ... 0 0 0] 



There are a number of useful metrics for evaluation of binary classifiers, available through `classification_report`, `confusion_matrix` and `accuracy_score` functions:

* **Confusion Matrix**: a matrix that indicates how many samples are correctly or incorrectly classified. The cell at $i$-th row and $j$-th column represents how many samples that belong to $i$-th class and are predicted as $j$-th class. For binary classification, the confusion matrix has only two columns and two rows:

|Class|True               |False              |
|-----|-------------------|-------------------|
|True |True Positive (TP) |False Negative (FN)|
|False|False Positive (FP)|True Negative (TN) |

* **Accuracy**: proportion of samples that are correctly classified.

$$
Accuracy = \frac{TP+TN}{TP+FP+TN+FN}
$$

* **Precision**: of all positive predictions, how many of them are actually correct?

$$
Precision = \frac{TP}{TP+FP}
$$

* **Recall**: of all actually positive samples, how many of them are predicted correctly?

$$
Recall = \frac{TP}{TP+FN}
$$

* **F1 Score**: the harmonic mean of precision and recall.

$$
F1 = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}
$$

We first calculates and prints various metrics for training data:

In [16]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Predict and evaluate on training data
pred = classifier.predict(X_train)

# `classification_report` outputs classification metrics
# such as precision, recall and F1 score
print(classification_report(y_train, pred))

# `confusion_matrix` outputs how many samples are correctly or incorrectly classified
print('Confusion Matrix: \n', confusion_matrix(y_train, pred), "\n")

# `accuracy` computes classification accuracy
print('Accuracy: ', accuracy_score(y_train, pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3457
           1       1.00      1.00      1.00      1099

    accuracy                           1.00      4556
   macro avg       1.00      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556

Confusion Matrix: 
 [[3457    0]
 [   0 1099]] 

Accuracy:  1.0


We now calculates and prints the same metrics for testing data. This measures the ability of the classification model to generalize to similar yet unknown data. The less difference in training and testing data, the better the model is.

In [17]:
# Print predictions on training data
# `predict` function compute model predictions from input data
print("Training prediction:\n", classifier.predict(X_test), "\n")

# Print the actual labels
print("Training actual:\n", y_test.values, "\n")

Training prediction:
 [1 0 0 ... 0 0 0] 

Training actual:
 [1 0 0 ... 0 0 0] 



In [18]:
# Predict and evaluate on training data
pred = classifier.predict(X_test)

# `classification_report` outputs classification metrics
# such as precision, recall and F1 score
print(classification_report(y_test, pred))

# `confusion_matrix` outputs how many samples are correctly or incorrectly classified
print('Confusion Matrix: \n', confusion_matrix(y_test, pred), "\n")

# `accuracy` computes classification accuracy
print('Accuracy: ', accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       870
           1       0.98      0.98      0.98       269

    accuracy                           0.99      1139
   macro avg       0.99      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139

Confusion Matrix: 
 [[865   5]
 [  5 264]] 

Accuracy:  0.9912203687445127


## Discussion Question: Why Bag-of-Words (BoW) Still Works (and its Limitations)

In this lab, we used **Bag-of-Words (BoW)** to convert email text into numerical features that a machine learning model can understand.

### A common concern with BoW
In the in-class discussion, we learned that some words appear in **many** documents (examples: *“the”*, *“and”*, *“hello”*, *“thanks”*). These very frequent words can cause two issues:

1. **They do not help distinguish spam vs. ham**  
   If a word appears in almost every email, it does not provide useful information for classification.

2. **Different emails can look similar in feature space**  
   Two different messages may share many common words, which can lead to **similar BoW representations**, even if their meaning is different.

---

### ✅ Your Task (Short Answer)
Even with the limitations above, BoW often performs surprisingly well for spam detection.

**Why does the Bag-of-Words method still work well in this lab?**  
Write a **short explanation** (2–4 sentences) and include **at least one clear reason** supported by what you observe in the dataset or model behavior.


In [None]:
### Please include your Answer here
# Bag-of-Words still works because the the dataset of spam and ham contains vastly different words.  The times Bag of Words fails is when the text is very different from normal emails, such as a different language, or the inclusion of a lot of uncommon words, abbreviations, names, or slang.
# When false predictions are compared to true predictions, we can see that the false predictions are much more "different" from the true predictions.
pred = classifier.predict(X_test)
test_dict = y_test.to_dict()
i = 0
for prediction, true in zip(pred, test_dict):
    if prediction != test_dict[true]:
        print("False positive" if prediction else "False negative")
        print(df["text"][true], end='\n\n')
    elif i < 10:
        print("True positive" if prediction else "True negative")
        print(prediction, test_dict[true])
        print(df["text"][true], end='\n\n')
        i += 1


True positive
1 1
Subject: please read : newsletter regarding smallcaps  small - cap stock finder  new developments expected to move western sierra mining , inc . stock  from 0 . 70 to over 4 . 00  westernsierramining . com  western sierra mining is a company on the move , fast ! big news is out !  big business is afoot for wsrm !  read on to find out why wsrm is our top pick this week .  * western sierra mining has a very profitable business model in which  they avoid the highest cost associate with mining : exploration .  essentially , wester sierra operates mines on sites that have been previously  explored and found to be " too small " for the largest mining companies ,  yet still produce handsome profits .  * the global mining industry boom will continue for the foreseeable  future due to the impact of china - driven demand on commodity prices and  long supply - response lead times .  * news ! news ! news ! read on to find out why we expect wsrm to take off  this week !  here is r

## References
1. https://github.com/randerson112358/Python/blob/master/Email_Spam_Detection/Email_Spam_Detection.ipynb
2. https://stackoverflow.com/questions/27488446/how-do-i-get-word-frequency-in-a-corpus-using-scikit-learn-countvectorizer