# Spam Detection
In this *notebook* we will explore the training of a *spam classifier*<sup>[1](#fnt1), [2](#fnt2)</sup> using Naive Bayes<sup>[3](#fnt3)</sup> and the UCI SMS Spam Collection data set.<sup>[4](#fnt4)</sup>

## Getting Started
Need to import the necessary modules and setup a few globals for the *data set*.

In [None]:
# get necessary libs
import os
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import make_pipeline

# setup a few globals
DATA_DIR_PATH = '/home/jovyan/data'
SMS_SPAM_DATASET_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
SMS_SPAM_DATASET_PATH = os.path.join(DATA_DIR_PATH, 'smsspamcollection.csv')

### Downloading UCI SMS Spam Data Set<sup>[4](#fnt4)</sup>
The first time the *notebook* runs, it will download the data set from *UCI*'s machine learning repository.

In [None]:
# now check if data already downloaded
if not os.path.exists(SMS_SPAM_DATASET_PATH):
    # get necessary libs for downloading
    import io
    import pathlib
    import requests
    import zipfile
    
    # create data dir if it does not exist
    pathlib.Path(DATA_DIR_PATH).mkdir(parents=True, exist_ok=True)
    
    # get data
    resp = requests.get(SMS_SPAM_DATASET_URL) 

    # save extract CSV data and save locally
    with zipfile.ZipFile(io.BytesIO(resp.content)) as zf:
        with zf.open('SMSSpamCollection') as data:
            with open(SMS_SPAM_DATASET_PATH, 'wb') as csvf:
                csvf.write(data.read())

### Loading SMS Spam Data Set
Because the data is a *tab-separated* file, we must use `pandas.read_table` to load it.

In [None]:
# get data loaded from table
sms_spam_collection_data = pd.read_table(
    SMS_SPAM_DATASET_PATH,
    sep='\t',
    header=None,
    names=['label', 'sms_message']
)

# sanity check
sms_spam_collection_data.head()

## Naive Bayes Classifier
Finally we begin *training*/*testing* of a *naive bayes* classifier<sup>[3](#fnt3)</sup> on the *data set*.

### Test/Train Split
First we need to build our *train*/*test* data sets from the *total* data set. To do this, *scikit-learn* provides a nice utility.<sup>[5](#fnt5)</sup> 

In [None]:
# now prepare test/train data
X = sms_spam_collection_data["sms_message"].values
y = sms_spam_collection_data["label"].values

# generate train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

### Pipelining
We will create a *pipeline*<sup>[6](#fnt6)</sup> composed of *CountVectorizer*<sup>[7](#fnt7)</sup>  and *BernoulliNB*.<sup>[8](#fnt8)</sup> 

In [None]:
# create the pipeline
bernoulli_nb_pipeline = make_pipeline(CountVectorizer(binary=True), BernoulliNB(alpha=10**-6))

### Model Training
Training the model is as simple as calling the `fit` method on the *pipeline*.

In [None]:
# pass the train data to the fit method
bernoulli_nb_pipeline.fit(X_train, y_train);

### Model Testing
Finally we will generate some predictions using the *freshly trained* model ... and the `X_test` data. We then compare the `y_test_pred` to the original `y_test` created from the `train_test_split` utility using the `classification_report` method.<sup>[9](#fnt9)</sup> 

In [None]:
# generate predictioned spam/ham labels based on X_test data
y_test_pred = bernoulli_nb_pipeline.predict(X_test)

# compare the predicted y vs. test y
print(classification_report(y_test, y_test_pred))

#### Visualizing Model Performance
It is a little easier to understand the *performance* of the model if we plot a *confusion matrix*<sup>[10](#fnt10)</sup> using scikit-learn's *ConfusionMatrixDisplay* class.<sup>[11](#fnt11)</sup>

In [None]:
# get necessary visualization libs
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

# setup the confusion matrix plot
fig, ax = plt.subplots(figsize=(10, 5))
ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred, ax=ax)
ax.xaxis.set_ticklabels(bernoulli_nb_pipeline.classes_)
ax.yaxis.set_ticklabels(bernoulli_nb_pipeline.classes_)
ax.set_title(
    f'Confusion Matrix for {bernoulli_nb_pipeline._final_estimator.__class__.__name__}'
    '\n'
    'on the SMS Spam Data Set'
);

## References
<span id="fnt1">1: [Email Filtering Wikipedia](https://en.wikipedia.org/wiki/Email_filtering)</span>
<br>
<span id="fnt2">2: [Anti-spam Wikipedia](https://en.wikipedia.org/wiki/Anti-spam_techniques)</span>
<br>
<span id="fnt3">3: [Naive Bayes Spam Filtering Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)</span>
<br>
<span id="fnt4">4: [UCI SMS Spam Collection Data Set](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection)</span>
<br>
<span id="fnt5">5: [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)</span>
<br>
<span id="fnt6">6: [sklearn.pipeline.Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)</span>
<br>
<span id="fnt7">7: [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)</span>
<br>
<span id="fnt8">8: [sklearn.naive_bayes.BernoulliNB](https://scikit-learn.org/stable/modules/naive_bayes.html#bernoulli-naive-bayes)</span>
<br>
<span id="fnt9">9: [sklearn.metrics.classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)</span>
<br>
<span id="fnt10">10: [Confusion Matrix Wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix)</span>
<br>
<span id="fnt11">11: [sklearn.metrics.ConfusionMatrixDisplay](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html)</span>
<br>
<span id="fnt12">12: [Build a NLP Pipeline with SciKit-Learn: Ham or Spam?](https://towardsdatascience.com/build-a-nlp-pipeline-with-scikit-learn-ham-or-spam-b2cd0b3bc0c1)</span>
<br>
<span id="fnt13">13: [edkrueger/spam-detector](https://github.com/edkrueger/spam-detector)</span>
<br>
<span id="fnt14">14: [Scikit-Learn : Spam Comment Filter Using SVM ](https://www.bogotobogo.com/python/scikit-learn/scikit_learn_Support_Vector_Machines_SVM_spam_filtermachine_learning_.php)</span>
<br>
<span id="fnt15">15: [How to Build a Spam Classifier in Python and Sklearn](https://www.milindsoorya.com/blog/build-a-spam-classifier-in-python)</span>
<br>
<span id="fnt16">16: [Spam detection using Scikit learn](https://www.kaggle.com/code/yakinrubaiat/spam-detection-using-scikit-learn/notebook)</span>
<br>
<span id="fnt17">17: [Spam Classifier using Naive Bayes](https://etav.github.io/projects/spam_message_classifier_naive_bayes.html)</span>
<br>