# SMS Spam Detection

The goal of this notebook is to explore a dataset of SMS messages, where these messages are either tagged as `spam` or `ham`(legitimate messages). We will build a predictive model to detect whether a certain text is a spam message.

The general layout of this notebook is as follows:

1. Imports and loading data
2. Exploring the text
3. Pre-processing
4. Turning the raw text into feature vectors(Using BERT)
5. Splitting the data into training/testing set
6. Utilizing a simple logistic regression model with the feature vectors


## Dataset

We are leveraging the SMS Spam dataset provided here -> https://www.kaggle.com/uciml/sms-spam-collection-dataset.

## Imports

In [10]:
import spacy
import statistics
import re
import pandas as pd
import numpy as np
import torch

from sklearn.model_selection import train_test_split

In [11]:
# Download the SpaCy "small" model if it is not already downloaded
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 419 kB/s eta 0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [12]:
# Load in a trained SpaCy pipeline to pre-process text
nlp = spacy.load("en_core_web_sm")

## Data Exploration

Here we are going to dig through the text and try to understand the shape of the text we're working with.

In [13]:
data = pd.read_csv("input/spam.csv")[["v1", "v2"]]
data.columns = ["label", "text"]
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Let's see the distribution of labels within the dataset.

In [14]:
data["label"].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

So in total we have:

* 5,572 texts
* 4,825(~83%) of texts are legitimate(`ham`)
* 747(~13%) of texts are `spam`

In [15]:
# Do we have any empty texts with no content in the dataset?
data[data["text"] == ""].count()

label    0
text     0
dtype: int64

In [16]:
data[data["text"].isnull()].count()

label    0
text     0
dtype: int64

So no null or missing values, are there any strange characters in the texts?

In [17]:
data["text"].apply(lambda x: re.findall(r'[^\w\s,.\(\)&!?*-]', x))

0           []
1           []
2       [', ']
3           []
4          [']
         ...  
5567       [�]
5568       [�]
5569        []
5570       [']
5571        []
Name: text, Length: 5572, dtype: object

Looking through the texts, it seems all emoji's have already been replaced with the � character.

Check the average length of each type message.

In [18]:
ham_data = data[data["label"] == "ham"]
spam_data = data[data["label"] == "spam"]

In [19]:
# Let's compare the average length of ham vs. spam messages.
average_ham_length = statistics.mean(ham_data["text"].apply(lambda x: len(x)))
average_spam_length = statistics.mean(spam_data["text"].apply(lambda x: len(x)))

print(f"Average length of Ham texts is: {average_ham_length}")
print(f"Average length of Spam texts is: {average_spam_length}")


Average length of Ham texts is: 71.02134715025906
Average length of Spam texts is: 138.429718875502


So spam messages do appear to be significantly longer on average. The spammer is trying to elicit some response from the user rather than having an actual conversation, so this makes sense.

## Data Preparation

Here we list some processing functions to parse the raw text and get it ready for further analytics/processing.

In [20]:
def process_text(text: str) -> str:
    # Lowercase the text
    text = text.lower()
    
    # Replace all numbers with the symbol "#"
    text = re.sub("\d+", "#", text)
    
    # Replace long sequences of whitespace with a single whitespace
    text = re.sub("\s+", " ", text)
        
    # Using SpaCy, remove stopwords and lemmatize each word
    doc = nlp(text)
    text = ' '.join([token.lemma_ for token in doc if not token.is_stop])
    
    # Now return the processed text
    return text

def process_df(df: pd.DataFrame) -> pd.DataFrame:
    """Pre-process a dataframe by adding in some descriptive columns, and pre-processing the text."""
    new_df = df.copy()
    new_df["text"] = new_df["text"].apply(lambda x: process_text(x))
    new_df["text_length"] = new_df["text"].apply(lambda x: len(x))
    
    return new_df

In [7]:
processed_df = process_df(data)

In [8]:
processed_df.head()

Unnamed: 0,label,text,text_length
0,ham,"jurong point , crazy .. available bugis n grea...",92
1,ham,ok lar ... joke wif u oni ...,29
2,spam,free entry # wkly comp win fa cup final tkts #...,113
3,ham,u dun early hor ... u c ...,27
4,ham,"nah think go usf , live",23


## Text Representation

When feeding text into some statistical model for prediction or for extracting insights, we must transform the text into some representation that the model will be able to understand. This is typically a vector of numbers used to describe a specific token, sentence, or document.

These vectors can be generated with hand-picked features, or they can be generated from pre-trained language models like word2vec or more recently BERT. These pre-trained models take a sentence as input and output a numeric value for each token that represents that word in the sentence. These representations have empirically established state-of-the-art results on most NLP tasks, proving themselves generally superior to hand-picked and curated features.

## BERT

This model utilizes a pre-trained BERT model to generate powerful vector representations of the sentences in each text.

In [9]:
from transformers import DistilBertModel, DistilBertTokenizer

BERT_MODEL_TYPE = "distilbert-base-uncased"

# Load in the pre-trained tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained(BERT_MODEL_TYPE)
model = DistilBertModel.from_pretrained(BERT_MODEL_TYPE)

In [10]:
# Tokenize the pre-processed text using BERT's "word-piece" tokenizer
# This turns text into a list of IDs, where each ID maps a token to a particular embedding in BERT's
# pre-trained embeddings table.
processed_df["encoded_text"] = processed_df["text"].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

In [11]:
processed_df["encoded_text"]

0       [101, 18414, 17583, 2391, 1010, 4689, 1012, 10...
1       [101, 7929, 2474, 2099, 1012, 1012, 1012, 8257...
2       [101, 2489, 4443, 1001, 1059, 2243, 2135, 4012...
3       [101, 1057, 24654, 2220, 7570, 2099, 1012, 101...
4       [101, 20976, 2228, 2175, 2149, 2546, 1010, 244...
                              ...                        
5567    [101, 1001, 1050, 2094, 2051, 3046, 1001, 3967...
5568    [101, 1035, 1038, 2175, 9686, 24759, 5162, 320...
5569    [101, 12063, 1010, 1008, 6888, 1012, 1012, 101...
5570    [101, 3124, 7743, 2075, 2552, 2066, 4699, 4965...
5571                 [101, 20996, 10258, 1012, 2995, 102]
Name: encoded_text, Length: 5572, dtype: object

When working with BERT, all vectors need to be of the same length. We can simply get the maximum length of a sequence in our dataset, and pad all other sequences to that length.

In [12]:
max_sequence_length = max(processed_df["text_length"])
max_sequence_length

579

BERT was designed to only be able to process 512 tokens from text at a time. Since the average text length is so much higher than this max length, we are making the assumption that we can simply take the first 512 tokens from each sequence. This means we will pad each sequence to 512.

In [13]:
# Pad each sequence to 512 tokens(using 0 as the padding token)
PADDING_TOKEN = 0
MAX_LENGTH = 512

processed_df["encoded_text"] = processed_df["encoded_text"].apply(
    lambda x: x[:512] + [PADDING_TOKEN for i in range(len(x), 512)]
)

Now the magic, we pass in these padded sequences through the pre-trained BERT model, and extract the last hidden layer. This hidden layer provides a feature vector for each token, and a feature vector for the sentence. For the purposes of spam detection(which is sentence classification), we will only extract the sentence-level feature vector, and discard the token-level feature vectors.

In [14]:
encoded_text = np.array([i for i in processed_df["encoded_text"].values])

# Construct a "mask", basically telling BERT to ignore the extra padding we have added
attention_mask = np.where(encoded_text != 0, 1, 0)

print(encoded_text.shape)
print(attention_mask.shape)

(5572, 512)
(5572, 512)


In [34]:
%%time
input_ids = torch.tensor(encoded_text)
attention_mask = torch.tensor(attention_mask)
""" 
    Due to resource limitations(writing this to run on a CPU versus a dedicated GPU machine),
    we are batching the input into the model. NOTE that this could take up to 20 minutes to finish on a 
    12 core machine with 16GB RAM.
"""
BATCH_SIZE = 100
features = []

with torch.no_grad():
    for i in range(0, len(input_ids), BATCH_SIZE):
        last_hidden_states = model(input_ids[i:i + BATCH_SIZE], attention_mask=attention_mask[i:i + BATCH_SIZE])
        # Extracting the sentence-level feature vector from the model output
        batch_features = last_hidden_states[0][:,0,:].numpy()
        if not len(features):
            features = np.copy(batch_features)
        else:
            features = np.concatenate((features, batch_features), axis=0)



CPU times: user 1h 43min 2s, sys: 14min 39s, total: 1h 57min 42s
Wall time: 20min 35s


In [37]:
features.shape

(5572, 768)

Because it took 20 minutes to generate these feature vectors, let's go ahead and write them out to disk so we can iterate faster.

In [39]:
np.save('bert-features.npy', features)

## Data Splitting

Here we split our data into a training/testing set. Normally you would also have a vlidation set used to tweak hyperparamaters of your model. Due to time-constraints we will be doing a simplye 80/20 training-test split, with an equal proprotion of `ham` and `spam` messages in each split.

In [49]:
train_data, test_data, train_labels, test_labels = train_test_split(features, processed_df["label"], test_size=0.2, random_state=42, stratify=processed_df["label"])

In [50]:
train_data.shape

(4457, 768)

In [51]:
test_data.shape

(1115, 768)

In [53]:
train_labels.value_counts()

ham     3859
spam     598
Name: label, dtype: int64

In [54]:
test_labels.value_counts()

ham     966
spam    149
Name: label, dtype: int64

## Logistic Regression

We are going to use our feature vectors obtained from BERT and use a tried-and-true logistic regression model to see what kind of results we get with this learned input representation.

In [55]:
from sklearn.linear_model import LogisticRegression

logreg_clf = LogisticRegression()
logreg_clf.fit(train_data, train_labels)

# Now that we've fit our model, evaluate on the test daata
logreg_clf.score(test_data, test_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9901345291479821

# Summary

This was a simple study on spam detection, but the key player here was the informative feature vectors we were able to automatically extract using a pre-trained deep learning model(BERT). While we lose the interpretibility of the features, we gain an empiricaly powerful representation of our texts based on a robust deep-learning architecture trained on terabytes of English text.

## Results

We achieve an accuracy of 99%, meaning we were able to correctly classify 99% of texts in the test set as either `ham` or `spam`. While with most machine-learning problems this high of an accuracy would indicate a coding mistake or extreme over-fitting, in the cases of simple spam detection it is actually not unusual at all. If you look at the notebooks on Kaggle for this problem most people seem to be achieving 97-99% accuracy on this dataset. This could be a result of spam messages rarely mimicking real conversations, and thus being easy to separate out.