# Let's build a spam classifier

We will use data from `SMS Spam Collection v. 1` described as:

> a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-encoded messages, tagged according being legitimate (ham) or spam.

([source](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/))

#### Load useful libraries and data

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\email\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\email\AppData\Roaming\nltk_data...


True

In [2]:
# Load data
data = pd.read_csv(
    "data/SMSSpamCollection.txt",
    encoding="utf-8",
    header=None,
    delimiter="\t",
    names=["target", "text"],
)

# Encoding target variable
data["target"] = np.where(data["target"] == "spam", 1, 0)

In [3]:
# Looking at a sample of our data
data.sample(3)

Unnamed: 0,target,text
1629,0,Yes princess! I want to make you happy...
4931,1,"Hi, the SEXYCHAT girls are waiting for you to ..."
3145,0,"Haha I heard that, text me when you're around"


## A quick look at the data

In [4]:
print("Dataset contains {} instances of {} variables.".format(data.shape[0], data.shape[1]))

print(
    "It contains {} spam messages ({:.1%} of all).".format(
        data[data.target == 1].shape[0],
        data[data.target == 1].shape[0] / data.shape[0],
    )
)

Dataset contains 5572 instances of 2 variables.
It contains 747 spam messages (13.4% of all).


In [5]:
## Printing random samples of text from both the classes i.e. Spam and non-Spam
print(
    "Examples of spam SMS: \n    {}\n    {}".format(
        data[data.target == 1].sample(1).text.iloc[0],
        data[data.target == 1].sample(1).text.iloc[0],
    )
)
print(
    "\nExamples of non-spam SMS: \n    {}\n    {}".format(
        data[data.target == 0].sample(1).text.iloc[0],
        data[data.target == 0].sample(1).text.iloc[0],
    )
)

Examples of spam SMS: 
    You have 1 new message. Please call 08718738034.
    Message Important information for O2 user. Today is your lucky day! 2 find out why log onto http://www.urawinner.com there is a fantastic surprise awaiting you

Examples of non-spam SMS: 
    No calls..messages..missed calls
    This is ur face test ( 1 2 3 4 5 6 7 8 9  &lt;#&gt;  ) select any number i will tell ur face astrology.... am waiting. quick reply...


## Spam classification

We will here build a "vanilla" classifier, without pouring too many thoughts about what the actual messages, spam or not, look like. To improve your model you can of course have a closer look and investigate the data more in detail. 

In [6]:
# Split dataset between train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data["text"], data["target"], random_state=0
)

In [7]:
X_train.head()

872     Its going good...no problem..but still need li...
831     U have a secret admirer. REVEAL who thinks U R...
1273                                                Ok...
3314    Huh... Hyde park not in mel ah, opps, got conf...
4929    Just hopeing that wasn‘t too pissed up to reme...
Name: text, dtype: object

### CountVectorizer

As you have seen, our X variable is just plain text == a string. No classifier can handle it, so we need to make the text accessible to the model. Therefore, we can transform the text so that each word is a separate feature and we count how many times that word occurs in the SMS. We can do this with the scikit-learn [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).  It will convert our text and return a sparse matrix. The vocabulary space of an English text is quite large, while in an SMS you will use only a small subset of words. Therefore saving this feature matrix as a sparse matrix will save memory space.



In [8]:
# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)
# print("X_train_vectorized: ", X_train_vectorized)

In [9]:
print("X_train shape = {}".format(X_train.shape))
print("Vocabulary length = {}".format(len(vect.vocabulary_)))

X_train shape = (4179,)
Vocabulary length = 7546


So in 4179 messages we found 7546 different words.

In [10]:
# Let's look at our vocabulary list (sorted alphabetically)
# Does it look like you expected?
sorted(vect.vocabulary_.items(), key=lambda x: x[1])[:20]

[('00', 0),
 ('000', 1),
 ('000pes', 2),
 ('008704050406', 3),
 ('0089', 4),
 ('0121', 5),
 ('01223585334', 6),
 ('0125698789', 7),
 ('02', 8),
 ('0207', 9),
 ('02072069400', 10),
 ('02073162414', 11),
 ('021', 12),
 ('03', 13),
 ('04', 14),
 ('0430', 15),
 ('05', 16),
 ('050703', 17),
 ('0578', 18),
 ('06', 19)]

In [11]:
# We can also print the newly created feature matrix
# Note: you see its a sparse matrix with many 0 values. 
# with .toarray() the compressed sparse matrix form is converted to a normal numpy array
print(X_train_vectorized.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


We can train our first model with the vectorized data.

In [12]:
# Train the model
model = LogisticRegression(max_iter=1500)
model.fit(X_train_vectorized, y_train)

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))
predict_probab = model.predict_proba(vect.transform(X_test))[:,1]

print("AUC = {:.3f}".format(roc_auc_score(y_test, predict_probab)))

AUC = 0.985


Let's see which words have the highest and lowest coefficients.

Think back to the sigmoid function (logistic function). 
What class are observations assigned to if they contain words with high coefficients?  And to which class are they assigned if they contain words with high negative coefficients?

In [13]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names_out())

# Sort the coefficients from the model (from lowest to highest values)
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1]
# so the list returned is in order of largest to smallest
print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['me' 'my' 'gt' 'lt' 'll' 'but' 'am' 'he' 'amp' 'right']

Largest Coefs: 
['txt' 'uk' 'ringtone' 'text' 'call' 'chat' 'reply' 'new' 'won' 'stop']


The AUC of our first model was already pretty good (~0.95). Let's see if we can improve this with another transformation of our data. Therefore, we will test the TF-IDF transformation next.




### TF-IDF

TF-IDF is short for **Term Frequency - Inverse Document Frequency**. 

It measure how important a word is to a document in a set of texts (in our case all SMS we collected). A frequent word in a document that is also frequent in the corpus is less important to a document than a frequent word in a document that is not frequent in the corpus.




In [14]:
# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 15
# This means a word should have been used in at least 15 SMS 
vect = TfidfVectorizer(min_df=15).fit(X_train)

# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

# let's look of some of the words gathered with this method
sorted(vect.vocabulary_.items(), key=lambda x: x[1])[10:30]

[('2000', 10),
 ('250', 11),
 ('2nd', 12),
 ('50', 13),
 ('500', 14),
 ('5000', 15),
 ('800', 16),
 ('able', 17),
 ('about', 18),
 ('abt', 19),
 ('account', 20),
 ('actually', 21),
 ('address', 22),
 ('after', 23),
 ('afternoon', 24),
 ('again', 25),
 ('ah', 26),
 ('aight', 27),
 ('all', 28),
 ('alone', 29)]

In [15]:
# how many words appear in more than 15 text messages
len(sorted(vect.vocabulary_.items(), key=lambda x: x[1]))

577

We can check which words created the largest tfidf values for the texts.

In [16]:
# save all feature names == words in an array
feature_names = np.array(vect.get_feature_names_out())

#sort for the column names according to highest tfidf value in the column
sorted_tfidf_index = X_train_vectorized.toarray().max(0).argsort()

# print words with highest and lowest tfidf values
print("Smallest tfidf:\n{}\n".format(feature_names[sorted_tfidf_index[:10]]))
print("Largest tfidf: \n{}".format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest tfidf:
['2000' 'weekly' 'rate' '16' 'sae' 'await' 'vouchers' 'guaranteed' '1000'
 'collection']

Largest tfidf: 
['yup' 'with' 'sure' 'babe' 'heart' 'he' 'thank' 'why' 'happy' 'thanx']


Let's use our new features with logistic regression.

In [17]:
# Train the model
model = LogisticRegression(max_iter=1500)
model.fit(X_train_vectorized, y_train)

# Predict the transformed test documents
predictions = model.predict_proba(vect.transform(X_test))[:,1]

print("AUC = {:.3f}".format(roc_auc_score(y_test, predictions)))

AUC = 0.988


With only 577 features out of 7546 (7546 different words were in our training texts), we still get a high value for the AUC score.
Feel free to test different values for the minimum document frequency for the tf-idf vectorizer and see how this affects the model.

Again, we can look at the coefficients of our new model.


In [18]:
# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1]
# so the list returned is in order of largest to smallest
print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['me' 'my' 'gt' 'lt' 'll' 'that' 'ok' 'later' 'da' 'how']

Largest Coefs: 
['txt' 'call' 'text' 'free' 'stop' 'uk' 'claim' 'www' 'reply' '150p']


You can see that there is a lot of overlap in the features that received the highest and lowest coefficients compared to the previous model; regardless of how we convert our text into features, these words seem to be important for classifying spam with logistic regression.


#### Text data can be more preprocessed before being used as features in a model. We will first use stemming as an approach:

### Stemming

Stemming reduces a word to its stem. The result is less readable to humans, but makes the text more comparable across observations.

For example, the words "consult", "consultant", "consulting", " consultative", "consultants" have the same stem **"consult "**.

We will now add stemming as a preprocessing step to our workflow. The nltk PorterStemmer will generate the stems of the words. These features will be used in the CountVectorizer to create a matrix with the number of features (stemmed words).

In [19]:
# Initializing stemmer and countvectorizer 
stemmer = nltk.PorterStemmer()
cv_analyzer = CountVectorizer().build_analyzer()
# tfidf_analyzer = TfidfVectorizer(min_df=15).build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in cv_analyzer(doc))

# define CountVectorizer with stemming function 
stem_vectorizer = CountVectorizer(analyzer = stemmed_words)
# stem_vectorizer = TfidfVectorizer(min_df=15, analyzer = stemmed_words)


# Transform X_train
X_train_stem_vectorized = stem_vectorizer.fit_transform(X_train)

To break the above code cell into steps and show what steps are doing what and why are we doing this overall

- The function `build_analyzer()` of `CountVectorizer()` handles the pre-processing, tokenizing and n-grams generation for the text
- In the function `stemmed_words()` 
  - The text is first passed through the `build_analyzer()` and then each word in the text is stemmed to its base form
- This whole thing is called with the last step when we call `fit_transform()` on the dataset

- In the below cell we can see how `build_analyzer()` pre-processes the sample text and tokenize it
- And at the last line, the stemmer stems each word in the text to its base form

In [20]:
sample_text = X_train[:1]
print("Sample Text - ", sample_text[872])
print("-"*30)
print("Text after passing through build_analyzer - ", cv_analyzer(sample_text[872]))
print("-"*30)
print("Text after stemming - ",[stemmer.stem(w) for w in cv_analyzer(sample_text[872])])


Sample Text -  Its going good...no problem..but still need little experience to understand american customer voice...
------------------------------
Text after passing through build_analyzer -  ['its', 'going', 'good', 'no', 'problem', 'but', 'still', 'need', 'little', 'experience', 'to', 'understand', 'american', 'customer', 'voice']
------------------------------
Text after stemming -  ['it', 'go', 'good', 'no', 'problem', 'but', 'still', 'need', 'littl', 'experi', 'to', 'understand', 'american', 'custom', 'voic']


You can also try uncommenting the tfidf lines in the cell above, so instead of using CountVectorizer you can also use TfIDF

In [21]:
# Train the model with stemmed and vectorized dataset
model_stemm = LogisticRegression(max_iter=1500)
model_stemm.fit(X_train_stem_vectorized, y_train)

# Predict the transformed test documents
predictions = model_stemm.predict_proba(stem_vectorizer.transform(X_test))[:,1]

print("AUC = {:.3f}".format(roc_auc_score(y_test, predictions)))

AUC = 0.984


In [22]:
# get the feature names as numpy array
feature_names = np.array(stem_vectorizer.get_feature_names_out())

# Sort the coefficients from the model
sorted_coef_index = model_stemm.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1]
# so the list returned is in order of largest to smallest
print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['me' 'my' 'gt' 'lt' 'll' 'but' 'am' 'hope' 'he' 'that']

Largest Coefs: 
['txt' 'uk' 'rington' 'text' 'chat' 'new' 'repli' 'won' 'call' 'cost']


Now we see some other words in the features with absolute highest coefficients.
The AUC-score of classification is between the scores of our last two text representation attempts.

### Lemmatization

The same way we used stemming we can also apply lemmatization to our data.
Lemmatization reduces variant forms to base form (eg. am, are, is --> be; car, cars, car's, cars' --> car).


In [None]:
# Initialization
WNlemma = nltk.WordNetLemmatizer()
cv_analyzer = CountVectorizer().build_analyzer()
# cv_analyzer = TfidfVectorizer(min_df=15).build_analyzer()

def lemmatize_word(doc):
    return (WNlemma.lemmatize(t) for t in cv_analyzer(doc))

lemm_vectorizer = CountVectorizer(analyzer = lemmatize_word)
# lemm_vectorizer = TfidfVectorizer(min_df=15, analyzer=lemmatize_word)

# Transform X_train
X_train_lemm_vectorized = lemm_vectorizer.fit_transform(X_train)

In [None]:
X_train_lemm_vectorized.shape

With lemmatization we were able to reduce the features from ca. 7500 to 7100. 

In [None]:
from sklearn.metrics import f1_score

In [None]:
# Train the model with stemmed and vectorized dataset
model_lemm = LogisticRegression(max_iter=1500)
model_lemm.fit(X_train_lemm_vectorized, y_train)

# Predict the transformed test documents
predict_probab = model_lemm.predict_proba(lemm_vectorizer.transform(X_test))[:,1]
predictions = model_lemm.predict(lemm_vectorizer.transform(X_test))

print("F1 = {:.3f}".format(f1_score(y_test, predictions)))
print("AUC = {:.3f}".format(roc_auc_score(y_test, predict_probab)))



In [None]:
# get the feature names as numpy array
feature_names = np.array(lemm_vectorizer.get_feature_names_out())


# Sort the coefficients from the model
sorted_coef_index = model_lemm.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1]
# so the list returned is in order of largest to smallest
print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))

The result for this last model are very similar to the first model we tested. 

You can test how well lemmatization in combination with tf-idf is working on our example data. Just remove the `#` at the beginning of the line (don't forget to add `#` to the respective same lines before).

## Testing our spam classfier

Let's test our classifier. You can also input your own text.

In [None]:
# Your input below
input_text = "We went yesterday to the beach, call me first. But then also call the other guy pls. Use this number and then the other number"

# Or use an example for the test set
#input_text = X_test.sample(1).iloc[0]
input_text

In [None]:
# You can change the model with model_stemm or model_lemm 
if model.predict(vect.transform([input_text]))[0] == 1:
    print('This is a spam!')
else:
    print('Not a spam :)')

In [None]:
from sklearn.metrics import hamming_loss
print(f'The hamming loss is {hamming_loss(y_test,predictions):.4f}')

Feel free to test your own SMS messages and see which words you can add to change the prediction of a ham message to a spam message.