## Predicting Spam with text features

## Question : Can the credibility of mail be catagorized based on the use of capital letters and keywords in its message content?

### Introduction:

Spam mail, or unsolicited messages sent en mass to random recipients have grown in volume in recent years following the convenience and anonymity of the internet [Kaddoura et al., 2022]. Ranging from harmless business advertisements to scams taking advantage of vulnerable groups, this phenomenon imposes varying degrees of economic and emotional harm on a vast number of e-mail users [Kaddoura et al., 2022]. Attention grabbing words and an overuse of capital letters are common features of spam mail, [Mujtaba et al., 2017] and mail-filtering programs have long been proposed as a potential solution to reduce user interaction with spam mail [Kaddoura et al., 2022]. Since many of these programs use lexical and term-based features in identifying spam mail [Mujtaba et al., 2017], the reliability of these identifiers are important to the accuracy of spam mail detectors. We will be investigating if the legitimacy of mail can be categorized based on the use of capital letters and keywords in email using the UCI Machine Learing Repository's Spambase dataset which is composed of emails flagged as spam by recipients as well as personal and work-related non-spam emails. The dataset provides infomation regarding the longest run length of capital letters throughout the message (in the capital_run_length_longest column) as well percentage of certain characters in the message such as "free" and "!". We will be analyzing the accuracy of categorizing spam mail based on the percentage of the strings "000", "free" and "credit", as well as the longest run of capital letters in a message. Our exploratory data analysis showed that there was a significant difference in the mean value for these variables between spam and non-spam e-mails, and are likely good identifiers.




### Preliminary exploratory data analysis

In [1]:
import pandas as pd
from collections import ChainMap
from sklearn.model_selection import train_test_split

In [2]:
import altair as alt
alt.data_transformers.disable_max_rows()
# !{sys.executable} -m pip uninstall rfc3986-validator

DataTransformerRegistry.enable('default')

In [3]:
import numpy as np
np.random.seed(1234)

In [4]:
spam = pd.read_csv("spambase.data", header = None)

In [5]:
spam_titles = pd.read_csv("spambase.names", skiprows = 31)
spam_titles_split = spam_titles["1"].str.split(":", expand = True)
spam_headers = spam_titles_split[[0]].to_dict()
spam_headers = dict(ChainMap(*spam_headers.values()))

spam = spam.rename(columns = spam_headers).rename(columns = {57: "is_spam"})

In [6]:
spam_train, spam_test = train_test_split(spam, train_size = 0.75, stratify = spam["is_spam"])

#### Identifying Predictor Variables


In [7]:
# normalize data
# spam_train = (spam_train - spam_train.min()) / (spam_train.max() - spam_train.min())


spam_mail = spam_train[spam_train["is_spam"] == 1]
non_spam = spam_train[spam_train["is_spam"] == 0]
mean_spam = pd.DataFrame(spam_mail.mean().reset_index()).rename(columns = {"index": "variable", 0: "mean_value_spam"})
mean_non_spam = pd.DataFrame(non_spam.mean().reset_index()).rename(columns = {"index": "variable", 0: "mean_value_non_spam"})
mean_non_spam

mean_val_compare = mean_spam.merge(mean_non_spam, on = "variable")
mean_val_compare = mean_val_compare.assign(spam_non_spam_ratio = mean_val_compare["mean_value_spam"] / mean_val_compare["mean_value_non_spam"])
mean_val_compare = mean_val_compare.sort_values(by="spam_non_spam_ratio", ascending=False)
# drop 3d as it is invalid
mean_val_compare_plot_data = mean_val_compare[~mean_val_compare["variable"].isin(["word_freq_3d", "is_spam"])]

ratio_bar_plot_top_20 = (
    alt.Chart(
        mean_val_compare_plot_data.head(20), title="Ratios of Spam to Non-Spam Mean"
    ).mark_bar(color='orange')
    .encode(
        x=alt.X(
            "variable",
            title="Predictor Variable",
            sort='-y'
        ),
        y=alt.Y(
            "spam_non_spam_ratio",
            title="Mean Spam / Mean Non-Spam",
        ),
    )
)

ratio_bar_plot_top_20

  for col_name, dtype in df.dtypes.iteritems():


We also noticed the capital run length has an anomaly (9989) that makes the data much harder visualize, since all the other values are much smaller :

In [8]:
spam_train["capital_run_length_longest"].max()

9989

In [9]:
# set outlier to mean 
spam_train.loc[
    spam_train["capital_run_length_longest"] == spam_train["capital_run_length_longest"].max(), 
    "capital_run_length_longest"] = spam_train["capital_run_length_longest"].mean()

Reduce dataframes we will use to only the data we will need

In [10]:
spam_train_predictors = spam_train[["word_freq_000", "word_freq_credit", "word_freq_free", "capital_run_length_longest", "is_spam"]]
mean_val_predictors = mean_val_compare[mean_val_compare["variable"].isin(["word_freq_000", "word_freq_credit", "word_freq_free", "capital_run_length_longest"])]

spam_train_predictors_norm = (spam_train_predictors - spam_train_predictors.min()) / (spam_train_predictors.max() - spam_train_predictors.min())

#### Visualization of Predictor Variables 
Using scatter plots, we visualize the relationship between non-spam and spam mail and the predictor variables we chose.

There are many data points densely packed together, so we will only sample a subset of them to avoid overplotting

We will plot the word frequencies of "000" vs. "credit" as well as the word frequency of "free" vs. longest capital run length. In the plot, we will identify the points by spam and non-spam with a different colour and shape : 

In [11]:
# !{sys.executable} -m pip install check-jsonschema

In [12]:
zeros_and_credit_scatter_plot = (
    alt.Chart(
        spam_train_predictors.sample(int(len(spam_train_predictors) / 5)), title="Word Frequency of \"000\" vs. \'credit\'"
    ).mark_point(opacity=0.5)
    .encode(
        x=alt.X(
            "word_freq_credit",
            title="Frequency of \'credit\'",
        ),
        y=alt.Y(
            "word_freq_000",
            title="Frequency of \'000\'",
        ),
        color=alt.Color("is_spam:N", legend=alt.Legend(title=["Spam by colour and shape", "0 = non-spam", "1 = spam"], orient="left"), scale=alt.Scale(scheme='dark2')),
        shape="is_spam:N"
    )
)

zeros_and_credit_scatter_plot

  for col_name, dtype in df.dtypes.iteritems():


In [13]:
free_and_capital_scatter_plot = (
    alt.Chart(
        spam_train_predictors.sample(int(len(spam_train_predictors) / 5)), title="Word Frequency of \'free\' vs. Capital Run Length"
    ).mark_point(opacity=0.5)
    .encode(
        x=alt.X(
            "word_freq_free",
            title="Frequency of \'free\'",
            scale=alt.Scale(0, 10)
        ),
        y=alt.Y(
            "capital_run_length_longest",
            title="Capital Run Length",
            scale=alt.Scale(0, 10)
        ),
        color=alt.Color("is_spam:N", legend=alt.Legend(title=["Spam by colour and shape","0 = non-spam", "1 = spam"], orient="right"), scale=alt.Scale(scheme='dark2')),
        shape="is_spam:N"
    )
)

free_and_capital_scatter_plot

#predictor_scatter_plots = alt.hconcat(zeros_and_credit_scatter_plot, free_and_capital_scatter_plot)
#predictor_scatter_plots

### Methods of analysis
We will use classification with a model that we build and test. We will then use the results to assess how well certain certain variables of an e-mail message can be used to correctly classify an e-mail as spam and non-spam.
We chose four predictor variables we thought were indicative of whether an e-mail may be spam. We chose them because the mean values (mean of all the observed e-mails in the data set) of these variables showed large differences for spam and non-spam. Therefore, we believe high values for these variables are characteristic of spam mail and allow us to build an accurate prediction model using the `sklearn` package and the **KNN-Algorithm**. 
We will also verify the accuracy of our model by putting aside a testing set. To visualize our results, we will use line plots showing the accuracy of our model for varying values of `K`. We will also use multiple scattor plots showing trends between spam and non-spam mail and our predictor variables, where spam and non-spam would be identified by colour and shape. 
One plot will show this trend using our training set and correct labels, and the other will use our testing set and predicted labels.

### Expected outcomes and significance

We expect to find that it is possible to identify malicious emails based on the use of capital letters and certain keywords in the message.
In our personal experiences, untrustworthy emails tend to have grammatical errors and focus on monetary rewards, leading us to believe that this method will work.
The impact that these findings could have is drastic. Mail providers such as Gmail and Outlook could use such classification systems (and likely already do) to filter out spam
messages, provided that our model is accurate enough. A future question that we would foresee leading to is: Are there other potential grammatical devices that could predict the credibility of emails?

Works Cited:

    Kaddoura, S., Chandrasekaran, G., Elena Popescu, D., & Duraisamy, J. H. (2022). A systematic literature review on spam content detection and classification. PeerJ Computer Science, 8, e830. https://doi.org/10.7717/peerj-cs.830

    Mujtaba, G., Shuib, L., Raj, R. G., Majeed, N., & Al-Garadi, M. A. (2017). Email Classification Research Trends: Review and Open Issues. IEEE Access, 5, 9044–9064. https://doi.org/10.1109/access.2017.2702187