# Spam Classifier Project: My Journy in NLP.

This notebbok is a personal challenge. I would like to bild a spam classifier using a dataset made of texts. So here the first challenge: find a way to preprocess my data and turn the dataset into a dataframe.

## Step 1. Preprocessing the data

The first step is to gather the data and to process them into a DataFrame. 

I used this dataset from Kaggle: https://www.kaggle.com/datasets/veleon/ham-and-spam-dataset

To preprocess I followed this workflow:

1. Creating a directory for spam and ham mails.
2. Using the `email` library create list of parser mail.
3. Define a function to get the data in the right format using `BeautifulSoup` and creating a dictionary contains:
    * Subjects
    * Texts
    * Labels.

In [1]:
# Create directory for data path.
import os
base_directory = "../input/ham-and-spam-dataset/"
spam_mail_dir = os.listdir(base_directory + "spam")
ham_mail_dir = os.listdir(base_directory + "ham")

In [2]:
# Reading email
import email
import email.policy

In [3]:
# We create parses for e-mail in spam and ham.
def load_mail(filename, is_spam = True):
    """
    Read the path of the e-mail and return a parser for the email.
    """
    base_path = base_directory + ("spam" if is_spam else "ham")
    with open(os.path.join(base_path, filename), "rb") as f:
        return email.parser.BytesParser(policy = email.policy.default).parse(f)
    

    
spam = [load_mail(filename, is_spam = True) for filename in spam_mail_dir]
ham = [load_mail(filename, is_spam = False) for filename in ham_mail_dir]

In [4]:
print(f"Number of spam e-mail: {len(spam)}")
print(f"Number of ham e-mail: {len(ham)}")

Number of spam e-mail: 501
Number of ham e-mail: 2551


In [5]:
# We now process and get the content of each mail. 
# Importing BeautifulSoup
from bs4 import BeautifulSoup

#Define a function to get the content of emails.
def process_mail(emails, label, data_dict, default_topic = None):
    """
    Process emails: create a dictionary from topic, content and label of every single e-mail
        emails: list of parser email
        label: int: 1 = spam, 0 = ham
        data_dict: empty dictionary with 'subject', 'content', 'label' as keys
        default_topic = None: default subject of mail without it.
    """
    for mail in emails:
        payload = mail.get_payload()
        if isinstance(payload, list):
            process_mail(payload, label, data_dict, default_topic = mail["Subject"])
        else:
            if "Content-Type" in mail.keys():
                if "html" in mail["Content-Type"].lower():
                    try:
                        soup = BeautifulSoup(mail.get_content())
                        topic = mail["Subject"]
                        if topic == None:
                            topic = default_topic
                        content = soup.body.text
                        data_dict["subject"].append(topic)
                        data_dict["content"].append(content)
                        data_dict["label"].append(label)
                    except:
                        pass
                elif "plain" in mail["Content-Type"].lower():
                    try:
                        topic = mail["Subject"]
                        if topic == None: 
                            topic = default_topic
                        content = mail.get_content()
                        data_dict["subject"].append(topic)
                        data_dict["content"].append(content)
                        data_dict["label"].append(label)
                    except:
                        pass
            else: 
                pass

In [36]:
emails_dict = {"subject": [],
               "content": [],
               "label": []}

process_mail(spam, 1, emails_dict)
process_mail(ham, 0, emails_dict)

In [37]:
import pandas as pd
emails_df = pd.DataFrame(emails_dict)

In [40]:
emails_df.head()

Unnamed: 0,subject,content,label
0,Teach and Grow Rich,\r\n Do You Want To Teach and...,1
1,A marketplace where lenders compete for your b...,\n\n\n\nCopyright 2002 - All rights reservedIf...,1
2,"Adv: Mortgage Quotes Fast Online, No Cost",\n\nIf this promotion has reached you in error...,1
3,$10 a hour for watching e-mmercials! No joke!,\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUnlist Info...,1
4,Re: Your VIP Pass,###################################\n\n FREE ...,1


In [41]:
emails_df.tail()

Unnamed: 0,subject,content,label
2663,Re: Xine dependencies,"Once upon a time, QuaffA wrote :\n\n> I've tri...",0
2664,New toys,URL: http://diveintomark.org/archives/2002/09/...,0
2665,Neutrino and X-ray physicists win Nobel Prize,"URL: http://www.newsisfree.com/click/-2,867601...",0
2666,Re: Oh my...,"\n----- Original Message ----- \nFrom: ""Gregor...",0
2667,"$15,000 umbrella stand: nothing exceeds like e...",URL: http://boingboing.net/#85541081\nDate: No...,0


In [43]:
emails_df["label"].value_counts()

0    2166
1     502
Name: label, dtype: int64