Predicting Spam with Capital Letter Frequency

REMEMBER: upload both an html and .ipynb file to the assignment

## Question : Can the credibility of mail be catagorized based on the use of capital letters and keywords in its message content?

### Introduction:

Spam mail, or unsolicited messages sent en mass to random recipients have grown in volume in recent years following the convenience and anonymity of the internet [Kaddoura et al., 2022]. Ranging from harmless business advertisements to scams taking advantage of vulnerable groups, this phenomenon imposes varying degrees of economic and emotional harm on a vast number of e-mail users [Kaddoura et al., 2022]. Attention grabbing words and an overuse of capital letters are common features of spam mail, [Mujtaba et al., 2017] and mail-filtering programs have long been proposed as a potential solution to reduce user interaction with spam mail [Kaddoura et al., 2022]. Since many of these programs use lexical and term-based features in identifying spam mail [Mujtaba et al., 2017], the reliability of these identifiers are important to the accuracy of spam mail detectors. We will be investigating if the legitimacy of mail can be categorized based on the use of capital letters and keywords in email using the UCI Machine Learing Repository's Spambase dataset which is composed of emails flagged as spam by recipients as well as personal and work-related non-spam emails. The dataset provides infomation regarding the longest run length of capital letters throughout the message (in the capital_run_length_longest column) as well percentage of certain characters in the message such as "free" and "!". We will be analyzing the accuracy of categorizing spam mail based on the percentage of the strings "000", "free" and "credit", as well as the longest run of capital letters in a message. Our exploratory data analysis showed that there was a significant difference in the mean value for these attributes between spam and non-spam e-mails, and are likely good identifiers.




### Preliminary exploratory data analysis

In [1]:
import pandas as pd
from collections import ChainMap
from sklearn.model_selection import train_test_split

In [2]:
spam = pd.read_csv("spambase.data", header = None)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [3]:
spam_titles = pd.read_csv("spambase.names", skiprows = 31)
spam_titles_split = spam_titles["1"].str.split(":", expand = True)
spam_headers = spam_titles_split[[0]].to_dict()
spam_headers = dict(ChainMap(*spam_headers.values()))

spam = spam.rename(columns = spam_headers).rename(columns = {57: "is_spam"})

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,is_spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [4]:
spam_train, spam_test = train_test_split(spam, train_size = 0.75, stratify = spam["is_spam"])

#### Identifying Predictor Variables


In [7]:
spam_mail = spam_train[spam_train["is_spam"] == 1]
non_spam = spam_train[spam_train["is_spam"] == 0]
mean_spam = pd.DataFrame(spam_mail.mean().reset_index()).rename(columns = {"index": "attribute", 0: "mean_value_spam"})
mean_non_spam = pd.DataFrame(non_spam.mean().reset_index()).rename(columns = {"index": "attribute", 0: "mean_value_non_spam"})
mean_non_spam

mean_val_compare = mean_spam.merge(mean_non_spam, on = "attribute")
mean_val_compare = mean_val_compare.assign(spam_non_spam_ratio = mean_val_compare["mean_value_spam"] / mean_val_compare["mean_value_non_spam"])
mean_val_compare = mean_val_compare.sort_values(by="spam_non_spam_ratio", ascending=False)
mean_val_compare.head(20)

Unnamed: 0,attribute,mean_value_spam,mean_value_non_spam,spam_non_spam_ratio
57,is_spam,1.0,0.0,inf
3,word_freq_3d,0.11362,0.001114,101.965694
22,word_freq_000,0.247785,0.006198,39.978296
6,word_freq_remove,0.276586,0.006944,39.83063
19,word_freq_credit,0.214253,0.007327,29.243035
52,char_freq_$,0.171753,0.010373,16.558395
23,word_freq_money,0.212428,0.014701,14.449821
14,word_freq_addresses,0.110007,0.009187,11.974252
15,word_freq_free,0.523687,0.079637,6.575958
16,word_freq_business,0.290338,0.048001,6.048598


In [None]:
Reduce dataframes we will use to only the data we will need

In [8]:
spam_train_predictors = spam_train[["word_freq_000", "word_freq_credit", "word_freq_free", "capital_run_length_longest", "is_spam"]]
mean_val_predictors = mean_val_compare[mean_val_compare["attribute"].isin(["word_freq_000", "word_freq_credit", "word_freq_free", "capital_run_length_longest"])]

Unnamed: 0,attribute,mean_value_spam,mean_value_non_spam,spam_non_spam_ratio
22,word_freq_000,0.247785,0.006198,39.978296
19,word_freq_credit,0.214253,0.007327,29.243035
15,word_freq_free,0.523687,0.079637,6.575958
55,capital_run_length_longest,99.010302,18.534194,5.342034


#### Visualization of Predictor Variables 
Using scatter plots, we visualize the relationship between non-spam and spam mail and the predictor variables we chose.

We will plot the word frequencies of "000" vs. "credit" as well as the word frequency of "free" vs. longest capital run length. In the plot, we will identify the points by spam and non-spam with a different colour and shape : 

In [10]:
import altair as alt
alt.data_transformers.disable_max_rows()
# !{sys.executable} -m pip uninstall rfc3986-validator

DataTransformerRegistry.enable('default')

In [None]:
# !{sys.executable} -m pip install check-jsonschema

In [33]:
zeros_and_credit_scatter_plot = (
    alt.Chart(
        spam_train_predictors, title="Word Frequency of \"000\" vs. \'credit\'"
    ).mark_point(opacity=0.5)
    .encode(
        x=alt.X(
            "word_freq_credit",
            title="Frequency of \'credit\'",
        ),
        y=alt.Y(
            "word_freq_000",
            title="Frequency of \'000\'",
        ),
        color=alt.Color("is_spam", legend=alt.Legend(title="Spam by colour and shape", orient="left"), scale=alt.Scale(scheme='dark2')),
        shape="is_spam"
    )
)

free_and_capital_scatter_plot = (
    alt.Chart(
        spam_train_predictors, title="Word Frequency of \'free\' vs. Capital Run Length"
    ).mark_point(opacity=0.5)
    .encode(
        x=alt.X(
            "word_freq_free",
            title="Frequency of \'free\'",
        ),
        y=alt.Y(
            "capital_run_length_longest",
            title="Capital Run Length",
        ),
        color=alt.Color("is_spam", legend=alt.Legend(title="Spam by colour and shape", orient="right"), scale=alt.Scale(scheme='dark2')),
        shape="is_spam"
    )
)

predictor_scatter_plots = alt.hconcat(zeros_and_credit_scatter_plot, free_and_capital_scatter_plot)

predictor_scatter_plots



SchemaError: '#/definitions/TopLevelNormalizedHConcatSpec<GenericSpec>' is not a 'uri-reference'

Failed validating 'format' in metaschema['properties']['$ref']:
    {'format': 'uri-reference', 'type': 'string'}

On schema['$ref']:
    '#/definitions/TopLevelNormalizedHConcatSpec<GenericSpec>'

alt.HConcatChart(...)

Additionally, we will also plot the differences in mean value for the predictor variables between spam and non-spam mail with a bar plot :

In [32]:
# free_and_capital_scatter_plot = (
#     alt.Chart(
#         spam_train_predictors, title="Word Frequency of \'free\' vs. Capital Run Length"
#     ).mark_point()
#     .encode(
#         x=alt.X(
#             "word_freq_free",
#             title="Frequency of \'free\'",
#         ),
#         y=alt.Y(
#             "capital_run_length_longest",
#             title="Capital Run Length",
#         ),
#         # color=alt.Color("is_spam", legend=alt.Legend(title="Spam by colour and shape", orient="right"), scale=alt.Scale(scheme='dark2')),
#         # shape="is_spam"
#     )
# )

# spam_bar_plot = (
#     alt.Chart(
#         mean_val_predictors, title="Mean Value of Predictor Values"
#     ).mark_bar()
#     .encode(
#         x=alt.X(
#             "attribute",
#             title="Predictor Variable",
#         ),
#         y=alt.Y(
#             "mean_value_spam",
#             title="Mean Value",
#         ),
#     )
# )

ratio_bar_plot = (
    alt.Chart(
        mean_val_predictors, title="Mean Value of Predictor Values"
    ).mark_bar(color='orange')
    .encode(
        x=alt.X(
            "attribute",
            title="Predictor Variable",
            sort='-y'
        ),
        y=alt.Y(
            "spam_non_spam_ratio",
            title="Mean Value",
        ),
    )
)


ratio_bar_plot

### Methods of analysis
We will use classification with a model that we build and test. We will then use the results to assess how well certain certain attributes of an e-mail message can be used to correctly classify an e-mail as spam and non-spam.
We chose four predictor variables we thought were indicative of whether an e-mail may be spam. We chose them because the mean values (mean of all the observed e-mails in the data set) of these variables showed large differences for spam and non-spam. Therefore, we believe high values for these variables are characteristic of spam mail and allow us to build an accurate prediction model using the `sklearn` package and the **KNN-Algorithm**. 
We will also verify the accuracy of our model by putting aside a testing set. To visualize our results, we will use line plots showing the accuracy of our model for varying values of `K`. We will also use multiple scattor plots showing trends between spam and non-spam mail and our predictor variables, where spam and non-spam would be identified by colour and shape. 
One plot will show this trend using our training set and correct labels, and the other will use our testing set and predicted labels.

Works Cited:

    Kaddoura, S., Chandrasekaran, G., Elena Popescu, D., & Duraisamy, J. H. (2022). A systematic literature review on spam content detection and classification. PeerJ Computer Science, 8, e830. https://doi.org/10.7717/peerj-cs.830

    Mujtaba, G., Shuib, L., Raj, R. G., Majeed, N., & Al-Garadi, M. A. (2017). Email Classification Research Trends: Review and Open Issues. IEEE Access, 5, 9044–9064. https://doi.org/10.1109/access.2017.2702187