# Toward improving the spam classifier

The notebook `part_1_spam_classifier` provided a baseline classifier, we will investigate here how it can be improved through feature engineering.

As in the notebook `part_1_spam_classifier`, we will use data from `SMS Spam Collection v. 1` described as:

> a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

([source](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/))

#### Load useful librairies and data

In [None]:
import pandas as pd
import numpy as np
import re

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

import nltk

# Makingsure the required NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords

In [None]:
# Load data
spam_data = pd.read_csv(
    "./data/SMSSpamCollection.txt",
    encoding="utf-8",
    header=None,
    delimiter="\t",
    names=["target", "text"],
)

# Encoding target variable
spam_data["target"] = np.where(spam_data["target"] == "spam", 1, 0)

In [None]:
spam_data.head(20)

In [None]:
spam_data.sample(3)

## Deeper dive into the data

The way to engineer features with text is similar to the one when working with numbers. 

To get an intuition of which parameters could add predictiveness, let's take a deeper look at the data.

In [None]:
print(
    "Examples of spam SMS: \n    {}\n    {}".format(
        spam_data[spam_data.target == 1].sample(1).text.iloc[0],
        spam_data[spam_data.target == 1].sample(1).text.iloc[0],
    )
)
print(
    "\nExamples of non-spam SMS: \n    {}\n    {}\n".format(
        spam_data[spam_data.target == 0].sample(1).text.iloc[0],
        spam_data[spam_data.target == 0].sample(1).text.iloc[0],
    )
)

Spam messages seem to be longer than ham ones...

In [None]:
print("Average message length:")
print("   Spam = {:.0f} characters".format(np.mean([len(x) for x in spam_data[spam_data.target == 1].text])))
print("   Non spam = {:.0f} characters".format(np.mean([len(x) for x in spam_data[spam_data.target == 0].text])))

and they are!

Then, spam messages seem to contain more digits than ham ones. Let's check that.

#### ==> Finding specific characteristics using regular expressions

In [None]:
example = "URGENT This is our 2nd attempt to contact U. Your £900 prize from YESTERDAY is still awaiting collection. To claim CALL NOW 09061702893"
example

In [None]:
re.findall(r'\d', example)

In [None]:
re.findall(r'\D', example)[:10]

In [None]:
print("Average number of digits:")
print("   Spam = {:.0f}".format(np.mean([len(x) for x in list(spam_data[spam_data.target == 1].text.str.findall(r'\d'))])))
print("   Non spam = {:.1f}".format(np.mean([len(x) for x in list(spam_data[spam_data.target == 0].text.str.findall(r'\d'))])))

and they do :)

### Exploring spam corpus

In [None]:
spam_corpus = spam_data[spam_data.target == 1].text.copy()
# Have the spam corpus as a long string
spam_corpus = ' '.join(spam_corpus.tolist())
# Split this string into tokens
spam_tokens = nltk.word_tokenize(spam_corpus.lower())

In [None]:
spam_tokens[:10]

Let's have a look at the most frequent words.

In [None]:
dist_spam = nltk.FreqDist(spam_tokens)
dist_spam_sorted = {k: v for k,v in sorted(dist_spam.items(), key=lambda item: item[1], reverse=True)}

In [None]:
[(k, v) for k,v in dist_spam_sorted.items()][:10]

We can see that among the most frequent tokens, are punctuation (".", ",", "!") as well as very common words ("a", "to", "the"). These words are called "stop words" and we will remove them.

In [None]:
stop_words = list(set(stopwords.words('english'))) + ['u', 'ur']

In [None]:
stop_words[:10]

In [None]:
spam_tokens = [x for x in spam_tokens if x.isalpha() and x not in stop_words]

In [None]:
dist_spam = nltk.FreqDist(spam_tokens)
dist_spam_sorted = {k: v for k,v in sorted(dist_spam.items(), key=lambda item: item[1], reverse=True)}
[(k, v) for k,v in dist_spam_sorted.items()][:10]

### Exploring non spam corpus

In [None]:
non_spam_corpus = spam_data[spam_data.target == 0].text.copy()
non_spam_corpus = ' '.join(non_spam_corpus.tolist())
non_spam_tokens = nltk.word_tokenize(non_spam_corpus.lower())

non_spam_tokens = [x for x in non_spam_tokens if x.isalpha() and x not in stop_words]

dist_non_spam = nltk.FreqDist(non_spam_tokens)
dist_non_spam_sorted = {k: v for k,v in sorted(dist_non_spam.items(), key=lambda item: item[1], reverse=True)}

In [None]:
[(k, v) for k,v in dist_non_spam_sorted.items()][:10]

### Side note on stemming and lemmatization

##### Stemming

In [None]:
input1 = "List listed lists listing listings."
words1 = input1.lower().split(' ')
words1

In [None]:
words1 = nltk.word_tokenize(input1)
words1

In [None]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in nltk.word_tokenize(input1)]

More information about the Porter Stemmer Algorithm can be found here: https://tartarus.org/martin/PorterStemmer/

##### Lemmatization

In [None]:
WNlemma = nltk.WordNetLemmatizer()

In [None]:
example = 'Multiply the numbers independently and count decimal points then, for the division, push the decimal places like i showed you.'
example

In [None]:
example = nltk.word_tokenize(example.lower())
example

In [None]:
example = set([x for x in example if x.isalpha() and x not in stop_words])
example

In [None]:
sorted([porter.stem(t) for t in example])

In [None]:
sorted([WNlemma.lemmatize(t) for t in example])

## To take home

Based on the analysis above, which features would you add to improve our spam classifier? Is the AUC score improving?