# Importing The Libraries

In [1]:
# importing of library
import os
import pandas as pd
import numpy as np 
import re
from email import message_from_string
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
from IPython.display import display

# Importing the dataset

In [2]:
# folder where everything is stored
root = r"C:\Users\tengt\Downloads\archive"

# folders for each type of email
folders = {
    "spam": os.path.join(root, r"spam_2\spam_2"),
    "easy_ham": os.path.join(root, r"easy_ham\easy_ham"),
    "hard_ham": os.path.join(root, r"hard_ham\hard_ham"),
}

data = []

# go through each folder
for label, folder in folders.items():

    # go through each file in the folder
    for file in os.listdir(folder):

        # full path to the file
        path = os.path.join(folder, file)

        # make sure it's a file
        if os.path.isfile(path): 
            
            # read the file
            with open(path, "r", encoding="latin-1", errors="ignore") as f:
                text = f.read()

            # spam = 1, ham = 0
            data.append({
                "label": 1 if label == "spam" else 0,
                "message": text
            })

# put into a dataframe
dataset = pd.DataFrame(data)

# show how many emails in each class
print("Total emails:", len(dataset), "\n")

# print the label distribution
print(dataset['label'].value_counts())

Total emails: 4198 

label
0    2801
1    1397
Name: count, dtype: int64


Note:
- `0` → ham (normal / not spam)
- `1` → spam (could be ads, scams, phishing, malware, etc.)

In [3]:
# Display the first few rows of the dataframe
display(dataset.head())

Unnamed: 0,label,message
0,1,From ilug-admin@linux.ie Tue Aug 6 11:51:02 ...
1,1,From lmrn@mailexcite.com Mon Jun 24 17:03:24 ...
2,1,From amknight@mailexcite.com Mon Jun 24 17:03...
3,1,From jordan23@mailexcite.com Mon Jun 24 17:04...
4,1,From merchantsworld2001@juno.com Tue Aug 6 1...


In [4]:
# Save the DataFrame to a CSV file named 'spamAssassin.csv'
dataset.to_csv('spamAssassin.csv', index=False) # The index=False ensures the index is not saved

# Data Exploration

Exploring the dataset helps us better understand its structure and characteristics.

This dataset contains a collection of **ham (legitimate) and spam emails** made available by the **Spam Assassin Project**. The dataset is widely used for email filtering research and benchmarking. It includes plain-text emails without attachments, and the messages are organized into separate folders for spam and ham.

In total, the dataset contains **4,198 emails**, consisting of both spam and ham messages.
For this section, we will follow these steps:

1. Access a sample email from the dataset (first, middle, and last)  
2. Generate descriptive statistics  
3. Handle missing/null values  
4. Check for duplicate rows  
5. Check for empty emails  
6. Check for emails containing non-ASCII characters  

### Accessing Sample Emails from the Dataset (First, Middle, and Last)

The dataset contains 4198 rows (indexed 0 to 4197).  
We will examine the first, middle, and last emails to inspect their structure and determine the cleaning steps required.

In [109]:
# Accessing the content of the first email at index 0
print(dataset["message"][0])

From ilug-admin@linux.ie  Tue Aug  6 11:51:02 2002
Return-Path: <ilug-admin@linux.ie>
Delivered-To: yyyy@localhost.netnoteinc.com
Received: from localhost (localhost [127.0.0.1])
	by phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD
	for <jm@localhost>; Tue,  6 Aug 2002 06:48:09 -0400 (EDT)
Received: from phobos [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Tue, 06 Aug 2002 11:48:09 +0100 (IST)
Received: from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by
    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g72LqWv13294 for
    <jm-ilug@jmason.org>; Fri, 2 Aug 2002 22:52:32 +0100
Received: from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org
    (8.9.3/8.9.3) with ESMTP id WAA31224; Fri, 2 Aug 2002 22:50:17 +0100
Received: from bettyjagessar.com (w142.z064000057.nyc-ny.dsl.cnc.net
    [64.0.57.142]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id WAA31201 for
    <ilug@linux.ie>; Fri, 2 Aug 2002 22:50:11 +0100
    [6

In [110]:
# Accessing the content of the middle email at index 258700
print(dataset["message"][2098])

From fork-admin@xent.com  Thu Sep 19 13:14:49 2002
Return-Path: <fork-admin@xent.com>
Delivered-To: yyyy@localhost.example.com
Received: from localhost (jalapeno [127.0.0.1])
	by jmason.org (Postfix) with ESMTP id E6F3016F03
	for <jm@localhost>; Thu, 19 Sep 2002 13:14:47 +0100 (IST)
Received: from jalapeno [127.0.0.1]
	by localhost with IMAP (fetchmail-5.9.0)
	for jm@localhost (single-drop); Thu, 19 Sep 2002 13:14:47 +0100 (IST)
Received: from xent.com ([64.161.22.236]) by dogma.slashnull.org
    (8.11.6/8.11.6) with ESMTP id g8JC7hC18737 for <jm@jmason.org>;
    Thu, 19 Sep 2002 13:07:43 +0100
Received: from lair.xent.com (localhost [127.0.0.1]) by xent.com (Postfix)
    with ESMTP id 2E69B2940FC; Thu, 19 Sep 2002 05:04:06 -0700 (PDT)
Delivered-To: fork@example.com
Received: from sunserver.permafrost.net (u172n16.hfx.eastlink.ca
    [24.222.172.16]) by xent.com (Postfix) with ESMTP id 4BE5029409E for
    <fork@xent.com>; Thu, 19 Sep 2002 05:03:15 -0700 (PDT)
Received: from [192.168.12

In [111]:
# Accessing the content of the last email at index 517400
print(dataset["message"][4197])

Return-Path: <test-admin@lists.sourceforge.net>
Received: from usw-sf-list2.sourceforge.net (usw-sf-fw2.sourceforge.net
	[216.136.171.252]) by home.sewingwitch.com (8.11.6/8.11.6) with ESMTP id
	g9208B729827 for <shiva+qpopper-webdev@sewingwitch.com>; Tue, 1 Oct 2002
	17:08:11 -0700
Received: from usw-sf-list1-b.sourceforge.net ([10.3.1.13]
	helo=usw-sf-list1.sourceforge.net) by usw-sf-list2.sourceforge.net with
	esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17wX3l-0004o7-00 for
	<shiva+qpopper-webdev@sewingwitch.com>; Tue, 01 Oct 2002 17:08:13 -0700
Date: Tue, 01 Oct 2002 17:08:10 -0700
Subject: (SPAM? 08.00) lists.sourceforge.net mailing list memberships reminder
From: mailman-owner@lists.sourceforge.net
To: shiva+qpopper-webdev@sewingwitch.com
X-No-Archive: yes
X-Ack: no
Sender: test-admin@lists.sourceforge.net
Errors-To: test-admin@lists.sourceforge.net
X-BeenThere: test@lists.sourceforge.net
X-Mailman-Version: 2.0.9-sf.net
Precedence: bulk
Message-Id: <E17wX3l-0004o7-00@usw-sf-list2.sou

From inspecting the first, middle, and last emails, we can see the general structure and content of the dataset.  

Key observations include:
- Emails contain extensive headers and metadata, which are not needed for text analysis.
- There is inconsistent formatting, including line breaks, tabs, and spaces, which will need cleaning.
- All emails appear to use standard ASCII encoding, but we will still check for encoding issues.

These insights help us identify potential issues and guide the next steps in cleaning and parsing the dataset. Before proceeding, we will continue with data exploration to gain a better understanding of the dataset.

### Descriptive Statistics 

In [112]:
# Make a copy to prevent mutation
data_ds = dataset.copy()

# Descriptive statistics
print(data_ds.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4198 entries, 0 to 4197
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    4198 non-null   int64 
 1   message  4198 non-null   object
dtypes: int64(1), object(1)
memory usage: 65.7+ KB
None


### Handling Missing Values

In [113]:
# Check for missing values in the dataframe
print(data_ds.isna().sum().sort_values())

label      0
message    0
dtype: int64


### Check for Duplicate Rows

In [114]:
# Shape of data_ds before removing duplicates
print(f"Shape before removing duplicates: {data_ds.shape}")

# Removing duplicate rows
data_ds = data_ds.drop_duplicates(subset=["message"]).reset_index(drop=True)

# Shape of data_ds after removing duplicates
print(f"Shape after removing duplicates: {data_ds.shape}")

Shape before removing duplicates: (4198, 2)
Shape after removing duplicates: (4178, 2)


Based on the dataset summary from `info()`, all 4178 emails have non-null values in the `message` column, so there are no missing entries. However, this does not guarantee that all emails contain meaningful content, as some messages body could still be empty. Therefore, we performed a check to identify any emails with empty message bodies.

In addition, 20 duplicate emails (based on the `message` column) were removed, resulting in a cleaner dataset.

### Check for empty emails

In [115]:
# Check for completely empty emails without removing spaces for parsing
empty_rows = data_ds[data_ds['message'] == ""]
print(f"Number of completely empty emails: {empty_rows.shape[0]}")

Number of completely empty emails: 0


### Check to see if there is any emails in non-ASCII Characters

In [116]:
# Function to check if a text contains any non-ASCII characters
def non_ascii_check(text):
    """
    Check if a string contains any non-ASCII characters.
    ASCII range = 0–127
    """
    # Ensure the input is a string
    text = str(text)

    # Loop through each character in the text
    for char in text:
        # ord(char) gives the Unicode code point
        if ord(char) > 127:  
            # Found a non-ASCII character
            return True

    # If we finish the loop, all characters are ASCII
    return False

# Apply to the 'message' column
non_ascii_rows = dataset[dataset['message'].apply(non_ascii_check)]
print(f"Number of emails with non-ASCII characters: {non_ascii_rows.shape[0]}")

Number of emails with non-ASCII characters: 294


Based on the data exploration, we observed that the Spam Assassin dataset contains no null values in the `message` column, no completely empty emails, and 20 duplicate rows were found. After this cleaning, the dataset consists of 4178 unique emails. We also identified a small number of emails containing non-ASCII characters. Since we are building a phishing email detection system, we have decided **not to remove these emails** and will handle them appropriately during system development. This is beneficial, as such emails may help detect unusual or suspicious patterns while also verifying legitimate cases.  

Next, we proceed to clean the dataset to prepare the emails for parsing and analysis.

# Data Cleaning

In this section, we will clean the dataset using several methods:

1. Email Parsing  
2. Text Cleaning  
3. Post-Parsing Data Checks  

**Email parsing** involves extracting the meaningful content from each email, such as the body text, while removing unnecessary components like headers, metadata, or special formatting. This step is essential to prepare the emails for further cleaning, analysis, or natural language processing tasks.

### Email Parsing
Email parsing is essential to extract structured information from raw emails.  
We will split this process into three main sections:

1. **Header extraction:** Important fields like `Message-ID`, `Date`, `From`, `To`, `Subject`, etc will be extracted from the email headers.  
2. **Message body extraction:** The main content of the email will be isolated for further analysis, including text cleaning and phishing detection.  
3. **URL extraction:** Links are crucial for identifying suspicious or malicious content.  


In [117]:
# transform the email into correct format
message = dataset.loc[0]['message']
e = message_from_string(message)

e.items()

[('Return-Path', '<ilug-admin@linux.ie>'),
 ('Delivered-To', 'yyyy@localhost.netnoteinc.com'),
 ('Received',
  'from localhost (localhost [127.0.0.1])\n\tby phobos.labs.netnoteinc.com (Postfix) with ESMTP id 9E1F5441DD\n\tfor <jm@localhost>; Tue,  6 Aug 2002 06:48:09 -0400 (EDT)'),
 ('Received',
  'from phobos [127.0.0.1]\n\tby localhost with IMAP (fetchmail-5.9.0)\n\tfor jm@localhost (single-drop); Tue, 06 Aug 2002 11:48:09 +0100 (IST)'),
 ('Received',
  'from lugh.tuatha.org (root@lugh.tuatha.org [194.125.145.45]) by\n    dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id g72LqWv13294 for\n    <jm-ilug@jmason.org>; Fri, 2 Aug 2002 22:52:32 +0100'),
 ('Received',
  'from lugh (root@localhost [127.0.0.1]) by lugh.tuatha.org\n    (8.9.3/8.9.3) with ESMTP id WAA31224; Fri, 2 Aug 2002 22:50:17 +0100'),
 ('Received',
  'from bettyjagessar.com (w142.z064000057.nyc-ny.dsl.cnc.net\n    [64.0.57.142]) by lugh.tuatha.org (8.9.3/8.9.3) with ESMTP id WAA31201 for\n    <ilug@linux.ie>; Fri, 2 Aug 2

After examining a sample email, we found that the headers contain valuable information, including `Message-ID`, `Date`, `From`, `To`, `Subject`, and other relevant metadata such as `Sender`, `Return-Path`, and mailing list information (`List-Id`).  

To support further analysis and enable rule-based phishing detection, we extracted these fields from all emails and organized them into a structured DataFrame.

In [118]:
# Function to parse email and extract specified fields
def parse_email(raw_msg, fields=None):
    """
    Parse a raw email string and extract specified header fields.

    Parameters
    ----------
    raw_msg : str
        The raw email content as a string.
    fields : list[str], optional
        List of email fields to extract (e.g., ["From", "Subject"]).
        If None, a default set of common fields is used.

    Returns
    -------
    dict
        Dictionary mapping cleaned field names (lowercase, underscores)
        to their extracted values. Missing fields return None.
    """

    # Extract fields from a raw email string
    if fields is None:
        fields = ["Message-ID", "Date", "From", "To", "Subject", "Sender", "List-Id"] # Standard fields to extract if none provided

    try:
        email_obj = message_from_string(raw_msg)
        result = {}

        for field in fields:
            # make field names easier to use in df (lowercase, underscores)
            key = field.lower().replace("-", "_") # X-to -> x_to
            result[key] = email_obj.get(field) # Extract field value or None if missing
        return result
    
    except Exception as e:
        # if parsing fails, just fill with None
        return {field.lower().replace("-", "_"): None for field in fields}


def build_email_dataframe(df, message_col="message", fields=None):
    """
    Parse a DataFrame column of raw email messages into structured fields.

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame containing raw email messages.
    message_col : str, default "message"
        Name of the column in df that holds the raw email strings.
    fields : list[str], optional
        List of email fields to extract. If None, the default from parse_email is used.

    Returns
    -------
    pandas.DataFrame
        DataFrame where each row corresponds to an email and each column
        corresponds to a cleaned header field (e.g., from, subject, x_to).
    """

    # Parse emails in a DataFrame column into structured fields
    parsed_rows = []
    
    # Loop through each raw email in the DataFrame, show a progress bar while parsing,
    # and store the extracted fields as dictionaries in parsed_rows
    for msg in tqdm(df[message_col], total=len(df), desc="Parsing emails"):
        parsed_rows.append(parse_email(msg, fields))
    
    # Form a DataFrame from the list of parsed email dictionaries
    parsed_df = pd.DataFrame(parsed_rows, index=df.index)  # keep same index

    # Return the df with original and parsed fields
    return pd.concat([df, parsed_df], axis=1)  # merge with original

# extract specified fields from all emails in the dataset
extracted_df = build_email_dataframe(data_ds, message_col="message")

Parsing emails: 100%|██████████| 4178/4178 [00:00<00:00, 7575.24it/s]


#### Message Body Extraction

In [119]:
# Function to extract the body of each email
def body(messages):
    # Create an empty list to store email bodies
    column = []

    # Loop through each raw email message with a progress bar
    for message in tqdm(messages, total=len(messages), desc="Extracting email bodies"):
        # Parse the raw email string into an email object
        e = message_from_string(message)

        # Extract the body (payload) of the email
        column.append(e.get_payload())

    # Return the list of all extracted bodies
    return column

# Add a new column 'body' to the DataFrame by extracting the email body
extracted_df['body'] = body(data_ds['message'])

Extracting email bodies: 100%|██████████| 4178/4178 [00:00<00:00, 6601.97it/s]


In [120]:
# Display the first few rows of the new dataframe with extracted fields
display(extracted_df.head())

Unnamed: 0,label,message,message_id,date,from,to,subject,sender,list_id,body
0,1,From ilug-admin@linux.ie Tue Aug 6 11:51:02 ...,<1028311679.886@0.57.142>,"Fri, 02 Aug 2002 23:37:59 0530","""Start Now"" <startnow2002@hotmail.com>",ilug@linux.ie,[ILUG] STOP THE MLM INSANITY,ilug-admin@linux.ie,Irish Linux Users' Group <ilug.linux.ie>,Greetings!\n\nYou are receiving this letter be...
1,1,From lmrn@mailexcite.com Mon Jun 24 17:03:24 ...,<B0000178595@203.129.205.5.205.129.203.in-addr...,"Mon, 28 Jul 1980 14:01:35",lmrn@mailexcite.com,ranmoore@cybertime.net,"Real Protection, Stun Guns! Free Shipping! Ti...",,,"<html>\n<body>\n<center>\n<h3>\n<font color=""b..."
2,1,From amknight@mailexcite.com Mon Jun 24 17:03...,<0845b5355070f52WEBCUST2@webcust2.hightowertec...,"Wed, 30 Jul 1980 18:25:49",amknight@mailexcite.com,cbmark@cbmark.com,"New Improved Fat Burners, Now With TV Fat Abso...",,,"<html>\n<body>\n<center>\n<b>\n<font color=""bl..."
3,1,From jordan23@mailexcite.com Mon Jun 24 17:04...,<0925c5750200f52WEBCUST2@webcust2.hightowertec...,"Thu, 31 Jul 1980 07:20:54",jordan23@mailexcite.com,ranmoore@swbell.net,"New Improved Fat Burners, Now With TV Fat Abso...",,,"<html>\n<body>\n<center>\n<b>\n<font color=""bl..."
4,1,From merchantsworld2001@juno.com Tue Aug 6 1...,<200208040037.BAA09623@webnote.net>,"Sun, 19 Oct 1980 10:55:16",yyyy@pluriproj.pt,yyyy@pluriproj.pt,"Never Repay Cash Grants, $500 - $50,000, Secre...",,,"<html><xbody>\n<hr width = ""100%"">\n<center><h..."


### Validation of Body and Header Extraction  

After extracting the body of each email, it is important to validate the results. This is because, we want to check if there is any formatting issues within the dataset, we are unknown of. As this may caused some of the emails to not be parse correctly. As a results, leading to:  

1. **Incomplete or incorrect body extraction**  
   - In certain cases, parts of the email headers may still remain inside the `body` field instead of being fully separated.  
   - This requires manual or programmatic checks to confirm that the `body` column truly contains only the message content.  

2. **Null or missing values in other headers fields**  
   - Some header fields such as `to`, `from`, or `subject` may appear as null after parsing.  
   - These values may still exist within the raw email text but were not properly extracted during parsing.  

To address this, we will:  
- Inspect a sample of emails to verify that the `body` field contains the actual message rather than residual headers.  
- Cross-check the raw `message` text for cases where header fields (e.g., `to`) are null, and attempt to recover these values if possible.  

This step ensures that the dataset is **accurately structured** before proceeding to further cleaning and analysis.  

In [121]:
# Random Sample email 1
random_index = np.random.randint(0, len(extracted_df))
print(f"Random Sample Email at index {random_index}:\n")
print(extracted_df['body'][random_index])

Random Sample Email at index 3869:

URL: http://www.newsisfree.com/click/-2,8655708,215/
Date: 2002-10-08T03:30:58+01:00

*Politics: *The Conservative leadership yesterday launched itself into a frenzy 
of self-reproach as it struggled to shed the image of Britain's "nasty party".





In [122]:
# Random Sample email 2
random_index = np.random.randint(0, len(extracted_df))
print(f"Random Sample Email at index {random_index}:\n")
print(extracted_df['body'][random_index])

Random Sample Email at index 2646:

Just got this ... I was just reading mail, but in a very dark
room, where the keyboard is illuminated mostly by the light from
the (laptop) screen.   I think I put my fingers on the wrong keys.
(I mostly use the keyboard exclusively while running exmh).

This is from today's cvs (the fixes for the problems I mentioned
yesterday are included) - I eventually managed to contact the cvs
server.

expected integer but got ""
    while executing
"incr m"
    (procedure "MhSeqExpand" line 12)
    invoked from within
"MhSeqExpand $folder $msgids"
    (procedure "MhSeq" line 2)
    invoked from within
"MhSeq $folder $seq $how $oldmsgids $msgids"
    (procedure "Mh_SequenceUpdate" line 54)
    invoked from within     
"Mh_SequenceUpdate $folder replace $seq $msgids"
    (procedure "Seq_Set" line 4)
    invoked from within             
"Seq_Set $folder cur $msgid"
    (procedure "Mh_SetCur" line 7)      
    invoked from within                     
"Mh_SetCur $e

In [123]:
# Random Sample email 3
random_index = np.random.randint(0, len(extracted_df))
print(f"Random Sample Email at index {random_index}:\n")
print(extracted_df['body'][random_index])

Random Sample Email at index 209:


<html>
<head>
   <title>The Soft2Reg Team</title>
</head>
<link rel=STYLESHEET type=text/css href=css/main.css>
<STYLE type=text/css>
.Black11 {FONT-SIZE: 11px; COLOR: black; FONT-FAMILY: Verdana;  FONT-WEIGHT: normal; TEXT-DECORATION: none}
.Red13 {font-size : 13px; font-family :Verdana, Arial, Helvetica, sans-serif;font-weight : bold;color : Red}
</style>

<body bgcolor=ffffff>
<center>
<table border=0 width=600 cellpadding=2 cellspacing=0>
	<tr>
		<td width=275>
			<a href=http://www.soft2reg.com>
				<img src=http://www.soft2reg.com/images/logo.gif border=0 alt=www.soft2reg ></a>
		</td>
		<td align=right width=100%>
			<hr size=0>
		</td>
	</tr>
</table>
<p>
<table border=0 width=600 cellpadding=4 cellspacing=0 bgcolor=6699cc>
	<tr>
		<td>
			<font face=arial size=+1 color=000000>
				<b>Soft2reg.com -- Service Update</b></font>
		</td>
	</tr>
</table>
<p>
<table border=0 width=600 cellpadding=4 cellspacing=0>
	<tr>
		<td class=Black11>
			Hello

From the review of three sample emails, no major issues were identified. However, with over 4,000 emails in the dataset, manual inspection is not feasible. To ensure data quality, we will implement an automated validation process to:  

- Move any existing headers (if spotted) into their respective dataframe columns.  
- Extract the main body message.  

This process will produce a cleaned, standardized dataset that is ready for further analysis.

In [124]:
# Regex: normal headers that should not appear in the body
HEADER_RE = re.compile(
    r'^(return-path|delivered-to|message-id|date|from|to|subject|sender|errors-to|list-id)\s*:',
    re.IGNORECASE
)

# mapping header names -> df column names
HEADER_TO_COL = {
    "message-id": "message_id",
    "date": "date",
    "from": "from",
    "to": "to",
    "subject": "subject",
    "sender": "sender",
    "list-id": "list_id",
}

# Function to detect and extract body or  headers from email text
def extract_info(text):
    """
    Detect and Extract body from an email and recover any headers that leaked into it.
    Steps:
      1) Normalize line endings and split lines.
      2) Body starts after the first blank line (end of header block).
      3) Remove any header-like lines from the body.
      4) Return (clean_body, recovered_headers_dict).
    """
    if text is None:
        return "", {}

    # Normalize line endings 
    text = str(text).replace('\r\n', '\n').replace('\r', '\n')

    # Split into lines
    lines = text.split('\n')

    # find the first blank line in the email text
    header_end = -1  # default: assume no blank line (header/body split not found yet)
    
    # look for the first blank line
    for i, ln in enumerate(lines):
        if ln.strip() == "":
            header_end = i     # record the index of the blank line
            break              # stop at the first blank line

    # body_lines = everything after the header block
    if header_end >= 0:
        # found a blank line → body starts just after that line
        body_lines = lines[header_end + 1:]
    else:
        # no blank line found → treat the whole thing as body
        body_lines = lines

    # Initialize storage for recovered headers and cleaned body lines
    recovered = {}
    keep = []

    # process body lines
    for ln in body_lines:
        ls = ln.strip()

        # header-like line inside body
        m = HEADER_RE.match(ls)
        if m:
            hdr = m.group(1).lower()
            
            # ls is the current line stripped of whitespace
            if ":" in ls:
                # split into at most 2 parts: [before_colon, after_colon]
                parts = ls.split(":", 1)
                # take the right-hand side (after the first colon)
                val = parts[1].strip()
            else:
                # if somehow there is no colon, fallback to empty string
                val = ""

            recovered[hdr] = val
            continue    # skip this line (do not keep it in body)
        
        # keep everything else
        keep.append(ln)

    # return cleaned body and any recovered headers
    clean_body = "\n".join(keep).strip()
    return clean_body, recovered

# Helper to check if a cell is empty (NA or whitespace)
def is_empty(cell):
    """True if cell is NA or only whitespace."""

    # check for NA
    if pd.isna(cell):
        return True
    
    # check for empty or whitespace-only string
    if isinstance(cell, str) and cell.strip() == "":
        return True
    return False

# Function to clean email bodies and backfill missing headers into DataFrame
def apply_data(df, col_body="body"):
    """
    - Clean df[col_body] so it has ONLY the body (no headers, no ***** lines).
    - Write recovered headers into matching columns if those columns exist.
    - Return a dict with counts of how many cells were filled per header column.
    """
    # Initialize storage for cleaned bodies and fill counts 
    clean_bodies = []
    fill_counts = {col: 0 for col in df.columns if col in df.columns}

    # process each email body
    for i, msg in enumerate(df[col_body]):
        # extract clean body and any recovered headers
        body_clean, recovered = extract_info(msg)
        clean_bodies.append(body_clean)

        # backfill headers only into existing columns in your df
        for hdr, col in HEADER_TO_COL.items():
            if col not in df.columns:
                continue
            val = recovered.get(hdr)

            if val is None:
                continue

            # only fill if the cell is empty
            if is_empty(df.at[i, col]):

                # if the recovered value is a list, join with newlines
                if isinstance(val, list):
                    df.at[i, col] = "\n---\n".join(val)

                # otherwise just fill the string value
                else:
                    df.at[i, col] = val
                fill_counts[col] += 1

    # update the body column with cleaned bodies
    df[col_body] = clean_bodies

    return fill_counts

# clean the body and backfill any missing headers and also count how many cells were filled per column
fill_counts = apply_data(extracted_df, col_body="body")

# print how many cells were filled per column
for col, count in fill_counts.items():
    print(f"Filled \"{col}\" in {count} rows.")

Filled "label" in 0 rows.
Filled "message" in 0 rows.
Filled "message_id" in 0 rows.
Filled "date" in 0 rows.
Filled "from" in 0 rows.
Filled "to" in 4 rows.
Filled "subject" in 1 rows.
Filled "sender" in 1 rows.
Filled "list_id" in 1 rows.
Filled "body" in 0 rows.


#### URL Extraction

Next, we extract all URLs contained in the email bodies.  

URLs are important for phishing detection because suspicious or malicious links are often key indicators of phishing attempts.  

By isolating the URLs, we can analyze them separately and apply rules to identify potentially harmful links.

In [None]:
# Function to extract URLs from dataset['message'] directly
def extract_urls_from_message(raw_msg):
    
    # Ensure the input is a string
    if not isinstance(raw_msg, str):
        return None
    
    # Regex pattern to match URLs (http or https)
    url_pattern = r'(https?://[^\s,)\]>]+)'

    # Find all URLs in the raw message
    urls = re.findall(url_pattern, raw_msg)

    return urls if urls else None

# apply directly on the raw message column
extracted_df['urls'] = data_ds['message'].apply(extract_urls_from_message)
extracted_df['num_urls'] = extracted_df['urls'].apply(lambda x: len(x) if x is not None else 0)

In [126]:
# Display the first few rows to verify extraction
display(extracted_df.head())

Unnamed: 0,label,message,message_id,date,from,to,subject,sender,list_id,body,urls,num_urls
0,1,From ilug-admin@linux.ie Tue Aug 6 11:51:02 ...,<1028311679.886@0.57.142>,"Fri, 02 Aug 2002 23:37:59 0530","""Start Now"" <startnow2002@hotmail.com>",ilug@linux.ie,[ILUG] STOP THE MLM INSANITY,ilug-admin@linux.ie,Irish Linux Users' Group <ilug.linux.ie>,You are receiving this letter because you have...,[http://www.linux.ie/mailman/listinfo/ilug],1
1,1,From lmrn@mailexcite.com Mon Jun 24 17:03:24 ...,<B0000178595@203.129.205.5.205.129.203.in-addr...,"Mon, 28 Jul 1980 14:01:35",lmrn@mailexcite.com,ranmoore@cybertime.net,"Real Protection, Stun Guns! Free Shipping! Ti...",,,</b>\n</font>\n</h3>\n</center>\n<p>\nIT'S GET...,[http://www.geocities.com/realprotection_20022...,4
2,1,From amknight@mailexcite.com Mon Jun 24 17:03...,<0845b5355070f52WEBCUST2@webcust2.hightowertec...,"Wed, 30 Jul 1980 18:25:49",amknight@mailexcite.com,cbmark@cbmark.com,"New Improved Fat Burners, Now With TV Fat Abso...",,,"<font color=""blue"">\n\nLOSE 30 POUNDS IN 30 D...",[http://www.geocities.com/ultra_weightloss_200...,4
3,1,From jordan23@mailexcite.com Mon Jun 24 17:04...,<0925c5750200f52WEBCUST2@webcust2.hightowertec...,"Thu, 31 Jul 1980 07:20:54",jordan23@mailexcite.com,ranmoore@swbell.net,"New Improved Fat Burners, Now With TV Fat Abso...",,,"<font color=""blue"">\n\nLOSE 30 POUNDS IN 30 D...",[http://www.geocities.com/ultra_weightloss_200...,4
4,1,From merchantsworld2001@juno.com Tue Aug 6 1...,<200208040037.BAA09623@webnote.net>,"Sun, 19 Oct 1980 10:55:16",yyyy@pluriproj.pt,yyyy@pluriproj.pt,"Never Repay Cash Grants, $500 - $50,000, Secre...",,,"<li>Every day <b><font color = ""green"">million...","[http://www.geocities.com/grantzone_2002/"", ht...",2


Based on the extracted dataset, all key headers, the email body, and URLs have been successfully captured and standardized. The `body` field is readable and normalized, while `message_id` preserves its original format. Although this sample shows no URLs, the dataset is structured to capture them if present in other emails.  

### Text Cleaning

In this step, we clean the relevant text fields in the dataset to prepare for analysis and phishing detection.  

The cleaning process includes:

1. **Removing content inside angle brackets (`<...>`)** for all columns except `message_id`  
   - Standardizes email addresses and header fields.  

2. **Normalizing whitespace and removing unwanted content**  
   - Replace multiple spaces, tabs, and newlines with a single space.  
   - Remove leading and trailing spaces.  
   - Remove separator lines such as `----------` and `************`.  
   - Remove embedded HTML code.  

3. **Reordering and dropping columns**  
   - Adjust column order to match the workflow for phishing detection.  
   - Drop any unnecessary or redundant columns to simplify the dataset.  
   - This makes the dataset more organized and easier to work with in subsequent steps.  

This process ensures all text fields are **clean, consistent, and ready** for further processing, while preserving important information for phishing detection, including URLs, attachments, and non-ASCII characters.

In [127]:
# Make a copy to prevent mutation
final_df = extracted_df.copy()

# Function to clean text fields for phishing detection analysis
def clean_text(x, keep_tags=False):
    """
    Clean text fields for phishing detection analysis.

    Steps:
    1. Collapse whitespace.
    2. Remove <...> entirely unless keep_tags=True (preserve for message_id).
    3. Remove [] but keep the content inside.
    4. Remove quotes ' and ".
    5. Remove common separator lines: ----------, ************.
    6. Remove embedded HTML tags.
    """
    if x is None:
        return None  # No change

    text = str(x)

    # Remove <...> unless we want to keep tags (e.g., message_id)
    if not keep_tags:
        text = re.sub(r'<[^>]*>', '', text)  

    # Remove separator lines
    text = re.sub(r'[-*]{4,}', ' ', text)  # sequences of 4+ - or *

    # Remove brackets [] but keep content inside
    text = re.sub(r'[\[\]]+', '', text)

    # Remove quotes ' and "
    text = re.sub(r"[\'\"]+", '', text)

    # Collapse multiple spaces, tabs, newlines into a single space, and strip
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# NEW: helper to normalize address fields (From / To / Sender)
EMAIL_RE = re.compile(r'[A-Z0-9._%+\-]+@[A-Z0-9.\-]+\.[A-Z]{2,}', re.IGNORECASE)

# Normalize email address fields
def normalize_address_field(x):
    """
    Return only the email(s). If <...> present, prefer those.
    If multiple addresses, join with ', '.
    """
    if x is None:
        return None
    s = str(x)

    # Prefer emails inside <...>
    in_angles = re.findall(r'<\s*([^<>@\s]+@[^<>@\s]+)\s*>', s)
    if in_angles:
        return ', '.join(e.strip() for e in in_angles)

    # Fallback: any email-looking substrings
    any_emails = EMAIL_RE.findall(s)
    if any_emails:
        return ', '.join(e.strip() for e in any_emails)

    # If nothing matched, return cleaned plain text
    return clean_text(s)

# Apply cleaning to relevant columns
columns_to_clean = ['message_id', 'date', 'from', 'to', 'subject','sender', 'list_id', 'body', 'urls']

for col in columns_to_clean:
    if col in ('from', 'to', 'sender'):
        final_df[col] = final_df[col].apply(normalize_address_field)
    elif col == 'message_id':
        final_df[col] = final_df[col].apply(lambda x: clean_text(x, keep_tags=True))
    else:
        final_df[col] = final_df[col].apply(clean_text)


In [128]:
# Specify the desired column order
cols = ['message_id', 'date', 'from', 'to', 'subject', 'sender', 'list_id', 'body', 'urls', 'num_urls', 'label']

# Drop the unnecessary column and rearranging columns for better readability
final_df = final_df[cols]

In [129]:
# display the cleaned dataframe
display(final_df.head())

Unnamed: 0,message_id,date,from,to,subject,sender,list_id,body,urls,num_urls,label
0,<1028311679.886@0.57.142>,"Fri, 02 Aug 2002 23:37:59 0530",startnow2002@hotmail.com,ilug@linux.ie,ILUG STOP THE MLM INSANITY,ilug-admin@linux.ie,Irish Linux Users Group,You are receiving this letter because you have...,http://www.linux.ie/mailman/listinfo/ilug,1,1
1,<B0000178595@203.129.205.5.205.129.203.in-addr...,"Mon, 28 Jul 1980 14:01:35",lmrn@mailexcite.com,ranmoore@cybertime.net,"Real Protection, Stun Guns! Free Shipping! Tim...",,,"ITS GETTING TO BE SPRING AGAIN, PROTECT YOURSE...",http://www.geocities.com/realprotection_200220...,4,1
2,<0845b5355070f52WEBCUST2@webcust2.hightowertec...,"Wed, 30 Jul 1980 18:25:49",amknight@mailexcite.com,cbmark@cbmark.com,"New Improved Fat Burners, Now With TV Fat Abso...",,,LOSE 30 POUNDS IN 30 DAYS... GUARANTEED!!! All...,http://www.geocities.com/ultra_weightloss_2002...,4,1
3,<0925c5750200f52WEBCUST2@webcust2.hightowertec...,"Thu, 31 Jul 1980 07:20:54",jordan23@mailexcite.com,ranmoore@swbell.net,"New Improved Fat Burners, Now With TV Fat Abso...",,,LOSE 30 POUNDS IN 30 DAYS... GUARANTEED!!! All...,http://www.geocities.com/ultra_weightloss_2002...,4,1
4,<200208040037.BAA09623@webnote.net>,"Sun, 19 Oct 1980 10:55:16",yyyy@pluriproj.pt,yyyy@pluriproj.pt,"Never Repay Cash Grants, $500 - $50,000, Secre...",,,Every day millions of dollars are given away t...,"http://www.geocities.com/grantzone_2002/, http...",2,1


final### Post-Parsing Data Validation

After parsing and splitting the emails into separate columns, it is important to verify the integrity of the new dataset.  

We will:

1. **Inspect dataset summary** – Using `data.info()` to review column names, data types, and non-null counts.  
2. **Check for null values** – Some fields such as `subject` or `body` may be empty even if the original message was not null.  
3. **Check for duplicate rows** – Parsing may create redundant entries that should be removed.  

These steps ensure that the parsed dataset is **clean, consistent, and ready** for further analysis and phishing detection.

#### Descriptive Statistics

In [130]:
# Check the info of cleaned dataframe
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4178 entries, 0 to 4177
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   message_id  4176 non-null   object
 1   date        4177 non-null   object
 2   from        4177 non-null   object
 3   to          4012 non-null   object
 4   subject     4176 non-null   object
 5   sender      1978 non-null   object
 6   list_id     1723 non-null   object
 7   body        4178 non-null   object
 8   urls        3767 non-null   object
 9   num_urls    4178 non-null   int64 
 10  label       4178 non-null   int64 
dtypes: int64(2), object(9)
memory usage: 359.2+ KB


##### Handling of Null Values

In [131]:
# Check for missing values in the dataframe
print(final_df.isna().sum().sort_values())

body             0
num_urls         0
label            0
date             1
from             1
message_id       2
subject          2
to             166
urls           411
sender        2200
list_id       2455
dtype: int64


##### Checking for duplicates

In [132]:
# shape of dataset before removing duplicates
print(f"Shape before removing duplicates: {final_df.shape}")

# Removing duplicate rows
final_df = final_df.drop_duplicates().reset_index(drop=True)

# Shape of dataset after removing duplicates
print(f"Shape after removing duplicates: {final_df.shape}")

Shape before removing duplicates: (4178, 11)
Shape after removing duplicates: (4090, 11)


# Save the cleaned dataset
After checking for duplicates and adding the label column, the cleaned dataset is saved to a CSV file for later processing.

In [None]:
# Save the DataFrame to a CSV file named 'cleaned_SA.csv'
final_df.to_csv('cleaned_SA.csv', index=False) # The index=False ensures the index is not saved