# Spam Detection with Logistic Regression
**Author:** Marco Antonio García Sánchez  
**Objective:** Build a machine learning system to classify emails as SPAM or HAM using Logistic Regression.  
**Dataset:** [TREC 2007 Public Spam Corpus](http://plg.uwaterloo.ca/~gvcormac/trec07/)

---

**Original source / credit:**  
This notebook is based on the Udemy course: **"Machine Learning y Data Science: Curso Completo con Python"**  
URL: [udemy.com/course/machine-learning-desde-cero/learn/lecture/19203700)

**Modifications and improvements:**  
- Reorganized sections for clarity and reproducibility  
- Added detailed explanations in Markdown for better understanding  
- Enhanced visualizations and metrics reporting  
- Cleaned and standardized code for repository presentation  

---

This notebook is part of my Machine Learning portfolio on GitHub.

#### Dataset Selection

The objective of this project is to build a machine learning system capable of predicting whether a given email is SPAM or HAM.  
For this purpose, the following dataset has been selected:

**2007 TREC Public Spam Corpus**

The `trec07p` corpus contains **75,419 messages**:  

- 25,220 HAM (legitimate emails)  
- 50,199 SPAM  

These messages represent all emails delivered to a particular server between the following dates:  

- Sun, 8 Apr 2007 13:07:21 -0400  
- Fri, 6 Jul 2007 07:04:53 -0400  

This dataset will serve as the foundation for training and evaluating our Logistic Regression classifier.

## 1.- Environment & Libraries

#### Python Environment Setup
To ensure reproducibility and avoid conflicts between Python packages, it is recommended to use a **virtual environment** for each project.  
This way, you can create, manage, and delete environments without affecting your main Python installation.

#### Steps to create a virtual environment (optional but recommended):
```bash
# Create a virtual environment named 'ml_env'
python -m venv ml_env

# Activate the environment (Windows)
ml_env\Scripts\activate

# Activate the environment (macOS/Linux)
source ml_env/bin/activate

# After finishing the project, you can deactivate it
deactivate

# If you want to remove the environment completely, just delete the folder
rm -rf ml_env

## 2.- Data Cleaning

#### Cleaning HTML data

In this practical case related to SPAM email detection, the dataset consists of emails with their corresponding headers and additional fields. Therefore, the data requires **preprocessing** before being fed into the Machine Learning algorithm.

In [11]:
# ------------------------------------------------------------
# This class facilitates preprocessing of emails containing HTML code.
# It extracts only the text content, removing all HTML tags.
# ------------------------------------------------------------

from html.parser import HTMLParser

class MLStripper(HTMLParser):
    "MLStripper is a helper class for cleaning HTML content in emails.It inherits from Python's built-in HTMLParser."
    def __init__(self):
        super().__init__()
        self.strict = False          # Do not raise errors on malformed HTML
        self.convert_charrefs = True # Convert HTML character references automatically
        self.fed = []                # List to store text segments

    def handle_data(self, d):
        "This method is called for each chunk of text found inside HTML.It appends the text to the fed list."
        self.fed.append(d)

    def get_data(self):
        "Joins all text segments and returns the cleaned text."
        return ''.join(self.fed)

In [13]:
# ------------------------------------------------------------
# This function removes HTML tags from the text of an email.
# It uses the MLStripper class defined above to extract only the text content.
# ------------------------------------------------------------
def strip_tags(html):
    """
    Remove HTML tags from a given HTML string.
    Parameters:
    -----------
    html : str
        The HTML content to be cleaned.
    Returns:
    --------
    str
        Cleaned text with all HTML tags removed.
    Usage:
    ------
    clean_text = strip_tags(email_html_content)
    """
    s = MLStripper()
    s.feed(html)
    return s.get_data()

In [15]:
# Example of removing HTML tags from a more realistic email content
email_html = """
<html>
  <head><title>Special Offer!</title></head>
  <body>
    <h1>Congratulations!</h1>
    <p>Dear user,</p>
    <p>You have been selected to receive a <b>50% discount</b> on our premium plan.</p>
    <p>Click <a href="https://www.example.com/redeem">here</a> to claim your offer.</p>
    <footer>
      <p>Contact us at support@example.com</p>
      <p>&copy; 2025 Example Corp.</p>
    </footer>
  </body>
</html>
"""

# Clean the HTML using the strip_tags function
clean_text = strip_tags(email_html)
print(clean_text)



  Special Offer!
  
    Congratulations!
    Dear user,
    You have been selected to receive a 50% discount on our premium plan.
    Click here to claim your offer.
    
      Contact us at support@example.com
      © 2025 Example Corp.
    
  




#### Cleaning and reduce noise

In addition to removing possible HTML tags from emails, several other preprocessing steps are necessary to reduce noise in the messages. These steps include:

- **Removing punctuation** to clean the text from unnecessary symbols.  
- **Discarding irrelevant email fields** that do not contribute useful information.  
- **Stemming**: reducing words to their root form by removing affixes.  

The following class implements these transformations, preparing the email content for Machine Learning algorithms.

In [19]:
import email
import string
import nltk
nltk.download('stopwords')

class Parser:
    """
    Preprocess emails: remove HTML, punctuation, stopwords, and apply stemming.
    """

    def __init__(self):
        self.stemmer = nltk.PorterStemmer()
        self.stopwords = set(nltk.corpus.stopwords.words('english'))
        self.punctuation = list(string.punctuation)

    def parse(self, email_path):
        """Parse email file and return content as dict."""
        with open(email_path, errors='ignore') as e:
            msg = email.message_from_file(e)
        return None if not msg else self.get_email_content(msg)

    def get_email_content(self, msg):
        """Extract subject, body, and content type from email."""
        subject = self.tokenize(msg['Subject']) if msg['Subject'] else []
        body = self.get_email_body(msg.get_payload(), msg.get_content_type())
        content_type = msg.get_content_type()
        return {"subject": subject, "body": body, "content_type": content_type}

    def get_email_body(self, payload, content_type):
        """Extract and preprocess email body."""
        body = []
        if type(payload) is str and content_type == 'text/plain':
            return self.tokenize(payload)
        elif type(payload) is str and content_type == 'text/html':
            return self.tokenize(strip_tags(payload))
        elif type(payload) is list:
            for p in payload:
                body += self.get_email_body(p.get_payload(), p.get_content_type())
        return body

    def tokenize(self, text):
        """Clean text: remove punctuation, stopwords, and apply stemming."""
        for c in self.punctuation:
            text = text.replace(c, "")
        text = text.replace("\t", " ").replace("\n", " ")
        tokens = list(filter(None, text.split(" ")))
        return [self.stemmer.stem(w) for w in tokens if w not in self.stopwords]

[nltk_data] Downloading package stopwords to /Users/marco/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Reading and Preprocessing a Single Email

In the following cells, we:
1. Read the raw content of an individual email from the dataset (`inmail.1`).
2. Process it with the `Parser()` class to extract only the relevant words.

⚠️ Printing the entire raw email is optional, but it is a useful **visual tool** to see what the functions are doing and how the email content looks before cleaning.

👉 The goal of the parser is to keep only the **meaningful words** in the message (e.g., "free", "money", "offer") and remove uninformative words such as articles (*the, a, an, el, los*) and other stopwords.  
This reduces noise and makes the model focus on the truly discriminative features when classifying emails as spam or ham.

In [37]:
# Read the content of a specific email
# (Optional: this step helps visualize the raw structure of an email before cleaning)
inmail = open("/Users/marco/Desktop/GitRepositorios/Datasets/trec07p/data/inmail.1").read()
print(inmail)

From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007
Return-Path: <RickyAmes@aol.com>
Received: from 129.97.78.23 ([211.202.101.74])
	by speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;
	Sun, 8 Apr 2007 13:07:21 -0400
Received: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100
Message-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>
From: "Tomas Jacobs" <RickyAmes@aol.com>
Reply-To: "Tomas Jacobs" <RickyAmes@aol.com>
To: the00@speedy.uwaterloo.ca
Subject: Generic Cialis, branded quality@ 
Date: Sun, 08 Apr 2007 21:00:48 +0300
X-Mailer: Microsoft Outlook Express 6.00.2600.0000
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="--8896484051606557286"
X-Priority: 3
X-MSMail-Priority: Normal
Status: RO
Content-Length: 988
Lines: 24

----8896484051606557286
Content-Type: text/html;
Content-Transfer-Encoding: 7Bit

<html>
<body bgcolor="#ffffff">
<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0px; margin-bottom: 0px;" align="

In [39]:
# Process the email with Parser to extract only the cleaned and relevant words
# Stopwords and unimportant tokens are removed so the model can focus on meaningful content
p = Parser()
p.parse("/Users/marco/Desktop/GitRepositorios/Datasets/trec07p/data/inmail.1")

{'subject': ['gener', 'ciali', 'brand', 'qualiti'],
 'body': ['do',
  'feel',
  'pressur',
  'perform',
  'rise',
  'occas',
  'tri',
  'viagra',
  'anxieti',
  'thing',
  'past',
  'back',
  'old',
  'self'],
 'content_type': 'multipart/alternative'}

#### Procesamiento del archivo `index`  

In this section:  
1. We read the `index` file, which contains the list of all emails in the dataset along with their label (spam or ham).  
2. We define the function `parse_index()` that extracts:  
   - The **label** of the email (spam or ham).  
   - The **file path** of the corresponding email.  
   This is returned as a list of dictionaries for easier handling.  
3. We define the function `parse_email()` that takes one index entry, opens the email, and processes it with the `Parser` to obtain only the relevant words along with its label.  

👉 This step is essential because it gives us a clean representation of each email: its filtered content and its category (spam or ham).  

In [88]:
index = open("/Users/marco/Desktop/GitRepositorios/Datasets/trec07p/full/index").readlines()
index[0:3]

['spam ../data/inmail.1\n',
 'ham ../data/inmail.2\n',
 'spam ../data/inmail.3\n']

In [102]:
import os

DATASET_PATH = os.path.join("/Users/marco/Desktop/GitRepositorios/datasets", "trec07p")

def parse_index(path_to_index, n_elements):
    ret_indexes = []
    index = open(path_to_index).readlines()
    for i in range(n_elements):
        mail = index[i].split(" ../")
        label = mail[0]
        path = mail[1][:-1]
        path_mail = path.split("/")[-1]
        ret_indexes.append({"label":label, "email_path":os.path.join(DATASET_PATH, os.path.join("data", path_mail))})
    return ret_indexes

In [104]:
def parse_email(index):
    p = Parser()
    pmail = p.parse(index["email_path"])
    return pmail, index["label"]

In [106]:
indexes = parse_index("/Users/marco/Desktop/GitRepositorios/datasets/trec07p/full/index", 3)
indexes

[{'label': 'spam',
  'email_path': '/Users/marco/Desktop/GitRepositorios/datasets/trec07p/data/inmail.1'},
 {'label': 'ham',
  'email_path': '/Users/marco/Desktop/GitRepositorios/datasets/trec07p/data/inmail.2'},
 {'label': 'spam',
  'email_path': '/Users/marco/Desktop/GitRepositorios/datasets/trec07p/data/inmail.3'}]


## 3.- Preprocess Show

English:
In this code block we demonstrate how we transform an email from raw text (strings)
into numerical representations that can be used by Machine Learning algorithms. 
The steps are:

1. Load the index and read one email from the dataset (spam or ham).
2. Process the email with parse_email to separate subject and body.
3. Convert the text into numerical representations in two different ways:
   a) Bag of Words with CountVectorizer (each word is treated as a feature).
4. Display the extracted features and their numerical values.

This shows how raw text is converted into structured data 
for use in spam classification models.
"""

In [156]:
# Load the index and labels into memory (only one email for demonstration)
index = parse_index("/Users/marco/Desktop/GitRepositorios/datasets/trec07p/full/index", 1)
index

[{'label': 'spam',
  'email_path': '/Users/marco/Desktop/GitRepositorios/datasets/trec07p/data/inmail.1'}]

In [158]:
# Read the raw content of the first email
import os
open(index[0]["email_path"]).read()

'From RickyAmes@aol.com  Sun Apr  8 13:07:32 2007\nReturn-Path: <RickyAmes@aol.com>\nReceived: from 129.97.78.23 ([211.202.101.74])\n\tby speedy.uwaterloo.ca (8.12.8/8.12.5) with SMTP id l38H7G0I003017;\n\tSun, 8 Apr 2007 13:07:21 -0400\nReceived: from 0.144.152.6 by 211.202.101.74; Sun, 08 Apr 2007 19:04:48 +0100\nMessage-ID: <WYADCKPDFWWTWTXNFVUE@yahoo.com>\nFrom: "Tomas Jacobs" <RickyAmes@aol.com>\nReply-To: "Tomas Jacobs" <RickyAmes@aol.com>\nTo: the00@speedy.uwaterloo.ca\nSubject: Generic Cialis, branded quality@ \nDate: Sun, 08 Apr 2007 21:00:48 +0300\nX-Mailer: Microsoft Outlook Express 6.00.2600.0000\nMIME-Version: 1.0\nContent-Type: multipart/alternative;\n\tboundary="--8896484051606557286"\nX-Priority: 3\nX-MSMail-Priority: Normal\nStatus: RO\nContent-Length: 988\nLines: 24\n\n----8896484051606557286\nContent-Type: text/html;\nContent-Transfer-Encoding: 7Bit\n\n<html>\n<body bgcolor="#ffffff">\n<div style="border-color: #00FFFF; border-right-width: 0px; border-bottom-width: 0

In [160]:
# Parse the email into structured format (subject, body, and label)
mail, label = parse_email(index[0])
print("The email label is:", label)
print(mail)

The email label is: spam
{'subject': ['gener', 'ciali', 'brand', 'qualiti'], 'body': ['do', 'feel', 'pressur', 'perform', 'rise', 'occas', 'tri', 'viagra', 'anxieti', 'thing', 'past', 'back', 'old', 'self'], 'content_type': 'multipart/alternative'}


### CountVectorizer – Transforming Text into Numbers  

The **Logistic Regression algorithm cannot process raw text**. It needs numbers.  
To solve this, we use **`CountVectorizer`**, which converts emails into a numerical format.  

#### What CountVectorizer does  
1. **Tokenization** → Splits the email into individual words (*tokens*).  
   Example:  
   `"This email is spam"` → `[ "this", "email", "is", "spam" ]`  

2. **Builds a vocabulary** → Creates a dictionary of all unique words found in the text.  
   Example:  
   `{ "this": 0, "email": 1, "is": 2, "spam": 3 }`  

3. **Counts word occurrences** → For each email, counts how many times each word appears.  
   Example:  
   `"This email is spam spam"` → `[1, 1, 1, 2]`  
   *(“this” appears once, “email” once, “is” once, “spam” twice)*  

This is called a **Bag-of-Words representation**, because it ignores grammar and word order, but keeps the frequency of words.  

#### Why it matters  
- Logistic Regression works only with **numerical features**, so Bag-of-Words makes text usable.  
- Common, uninformative words (*“the”*, *“a”*, *“is”*, etc.) can be removed, since they don’t help classify spam vs ham (**stop-word removal**).  
- The result is a **document-term matrix**, which becomes the input for the machine learning model.  

In [163]:
from sklearn.feature_extraction.text import CountVectorizer

# Prepare the email as a single string (subject + body combined)
prep_email = [" ".join(mail['subject']) + " ".join(mail['body'])]

# Transform text into a Bag-of-Words representation
vectorizer = CountVectorizer()
X = vectorizer.fit(prep_email)

print("Email:", prep_email, "\n")
print("Input features (vocabulary):", vectorizer.get_feature_names_out())

Email: ['gener ciali brand qualitido feel pressur perform rise occas tri viagra anxieti thing past back old self'] 

Input features (vocabulary): ['anxieti' 'back' 'brand' 'ciali' 'feel' 'gener' 'occas' 'old' 'past'
 'perform' 'pressur' 'qualitido' 'rise' 'self' 'thing' 'tri' 'viagra']


In [165]:
# Convert the preprocessed email into numerical vector form
X = vectorizer.transform(prep_email)
print("\nNumerical values:\n", X.toarray())


Numerical values:
 [[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


### Function: `create_prep_dataset`

#### Description
Builds a dataset of emails ready for machine learning.  
Concatenates the subject and body of each email and associates it with its label (`spam` or `ham`).
#### Parameters
- `index_path` (str): Path to the dataset index file.  
- `n_elements` (int): Number of emails to process.  
#### Returns
- `X` (list of str): Full text of each email.  
- `y` (list of str): Corresponding labels (`spam` or `ham`).  
#### Notes
- Skips emails that cannot be parsed.  
- Useful for preparing raw emails before vectorization.

In [170]:
def create_prep_dataset(index_path, n_elements):
    X = []
    y = []
    indexes = parse_index(index_path, n_elements)
    for i in range(n_elements):
        print("\rParsing email: {0}".format(i+1), end='')
        try:
            mail, label = parse_email(indexes[i])
            X.append(" ".join(mail['subject']) + " ".join(mail['body']))
            y.append(label)
        except:
            pass
    return X, y

## 4.- Training Dataset Preparation

In this step, we prepare a subset of emails to train the machine learning model.  
- We read **1000 emails** from the dataset (you can adjust the number as needed).  
- The emails are preprocessed and transformed into a numerical format using the vectorizer.  
- Finally, we display the numerical feature matrix and the number of features for inspection.  
- Optionally, we convert it into a **Pandas DataFrame** for easier visualization and analysis.  


In [258]:
X_train, y_train = create_prep_dataset("/Users/marco/Desktop/GitRepositorios/datasets/trec07p/full/index", 1000)
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

Parsing email: 1000

In [260]:
import pandas as pd
pd.DataFrame(X_train.toarray(), columns=[vectorizer.get_feature_names_out()]).head(5)


Unnamed: 0,00,000,0000,000000,000002,000048000000000,000099,0000ff,000115000000000,0001171749,...,绰۹ϵͳctsƽ,肾ǝvă,鏗ėvłq,饻jwk,鵵χ,낢ȏglgwă,뵭袵,뼰ʱϵ,쫷ƹư,쵼ã
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [263]:
**DataFrame explanation:**  
- Each **row** represents a single email.  
- Each **column** represents a unique word found across all emails.  
- The values are **1** if the word appears in the email, **0** otherwise.  
This allows you to see the presence or absence of each word in every email.

SyntaxError: invalid syntax (3097014262.py, line 1)

In [247]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
clf.fit(X_train, y_train)

### 4. Predicción

##### Lectura de un conjunto de correos nuevos

In [194]:
# Leemos 150 correos de nuestro conjunto de datos y nos quedamos únicamente con los 50 últimos 
# Estos 50 correos electrónicos no se han utilizado para entrenar el algoritmo
X, y = create_prep_dataset("/Users/marco/Desktop/GitRepositorios/datasets/trec07p/full/index", 150)
X_test = X[100:]
y_test = y[100:]

Parsing email: 150

##### Preprocesamiento de los correos con el vectorizador creado anteriormente

In [197]:
X_test = vectorizer.transform(X_test)

##### Predicción del tipo de correo

In [200]:
y_pred = clf.predict(X_test)
y_pred

array(['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam',
       'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham',
       'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam',
       'spam', 'spam'], dtype='<U4')

In [202]:
print("Predicción:\n", y_pred)
print("\nEtiquetas reales:\n", y_test)

Predicción:
 ['spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'ham' 'spam' 'spam' 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'ham' 'spam' 'spam' 'ham' 'spam' 'spam' 'ham' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam'
 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam' 'spam']

Etiquetas reales:
 ['spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam', 'spam']


##### Evaluación de los resultados

In [226]:
from sklearn.metrics import accuracy_score

print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.988


### 5. Aumentando el conjunto de datos

In [224]:
# Leemos 12000 correos electrónicos
X, y = create_prep_dataset("/Users/marco/Desktop/GitRepositorios/datasets/trec07p/full/index", 12000)

Parsing email: 12000

In [212]:
# Utilizamos 10000 correos electrónicos para entrenar el algoritmo y 2000 para realizar pruebas
X_train, y_train = X[:10000], y[:10000]
X_test, y_test = X[10000:], y[10000:]

In [214]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_train)

In [216]:
clf = LogisticRegression()
clf.fit(X_train, y_train)

In [218]:
X_test = vectorizer.transform(X_test)

In [220]:
y_pred = clf.predict(X_test)

In [222]:
print('Accuracy: {:.3f}'.format(accuracy_score(y_test, y_pred)))

Accuracy: 0.988
