In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")

In [10]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

# learn the 'vocabulary' of the training data (occurs in-place)

# examine the fitted vocabulary

📌 From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

📌 From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

## 📋 **Summary:**

> - `vect.fit(train)` **learns the vocabulary** of the training data
> - `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
> - `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

# 💾 Reading a text-based dataset into pandas

In [11]:
data=pd.read_pickle('en_emails_raw.pkl')
data=data[['class','message']].rename(columns={'class':'label'}).drop_duplicates()

In [12]:
# read file into pandas using a relative path
# data = pd.read_csv("/kaggle/input/data-spam-collection-dataset/spam.csv", encoding='latin-1')
# data.dropna(how="any", inplace=True, axis=1)
data.columns = ['label', 'message']

data.head()

Unnamed: 0,label,message
1,spam,\n ...
2,spam,Academic Qualifications available from prestig...
3,ham,Greetings all. This is to verify your subscri...
4,spam,try chauncey may conferred the luscious not co...
5,ham,"It's quiet. Too quiet. Well, how about a str..."


# 🔍 Exploratory Data Analysis (EDA)

In [13]:
data.groupby('label').describe()

Unnamed: 0_level_0,message,message,message,message
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,12184,12184,Greetings all. This is to verify your subscri...,1
spam,5403,5403,\n ...,1


We have `4825` ham message and `747` spam message

In [14]:
# convert label to a numerical variable
data['label_num'] = data.label.map({'ham':0, 'spam':1})
data.head()

Unnamed: 0,label,message,label_num
1,spam,\n ...,1
2,spam,Academic Qualifications available from prestig...,1
3,ham,Greetings all. This is to verify your subscri...,0
4,spam,try chauncey may conferred the luscious not co...,1
5,ham,"It's quiet. Too quiet. Well, how about a str...",0


> As we continue our analysis we want to start thinking about the features we are going to be using. This goes along with the general idea of feature engineering. The better your domain knowledge on the data, the better your ability to engineer more features from it. Feature engineering is a very large part of spam detection in general.

In [15]:
data['message_len'] = data.message.apply(len)
data.head()

Unnamed: 0,label,message,label_num,message_len
1,spam,\n ...,1,1404
2,spam,Academic Qualifications available from prestig...,1,632
3,ham,Greetings all. This is to verify your subscri...,0,861
4,spam,try chauncey may conferred the luscious not co...,1,909
5,ham,"It's quiet. Too quiet. Well, how about a str...",0,94


# 📑 Text Pre-processing

> Our main issue with our data is that it is all in text format (strings). The classification algorithms that we usally use need some sort of numerical feature vector in order to perform the classification task. There are actually many methods to convert a corpus to a vector format. The simplest is the `bag-of-words` approach, where each unique word in a text will be represented by one number.


> In this section we'll convert the raw messages (sequence of characters) into vectors (sequences of numbers).

> As a first step, let's write a function that will split a message into its individual words and return a list. We'll also remove very common words, ('the', 'a', etc..). To do this we will take advantage of the `NLTK` library. It's pretty much the standard library in Python for processing text and has a lot of useful features. We'll only use some of the basic ones here.

> Let's create a function that will process the string in the message column, then we can just use **apply()** in pandas do process all the text in the DataFrame.

>First removing punctuation. We can just take advantage of Python's built-in **string** library to get a quick list of all the possible punctuation:

In [16]:
import string
from nltk.corpus import stopwords

def text_process(mess):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    STOPWORDS = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']
    # Check characters to see if they are in punctuation
    nopunc = [char for char in mess if char not in string.punctuation]

    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return ' '.join([word for word in nopunc.split() if word.lower() not in STOPWORDS])

In [17]:
data.head()

Unnamed: 0,label,message,label_num,message_len
1,spam,\n ...,1,1404
2,spam,Academic Qualifications available from prestig...,1,632
3,ham,Greetings all. This is to verify your subscri...,0,861
4,spam,try chauncey may conferred the luscious not co...,1,909
5,ham,"It's quiet. Too quiet. Well, how about a str...",0,94


> Now let's "tokenize" these messages. Tokenization is just the term used to describe the process of converting the normal text strings in to a list of tokens (words that we actually want).

In [18]:
data['clean_msg'] = data.message.apply(text_process)

data.head()

Unnamed: 0,label,message,label_num,message_len,clean_msg
1,spam,\n ...,1,1404,LUXURY WATCHES BUY ROLEX 219 Rolex Cartier Bvl...
2,spam,Academic Qualifications available from prestig...,1,632,Academic Qualifications available prestigious ...
3,ham,Greetings all. This is to verify your subscri...,0,861,Greetings verify subscription plan9fans list c...
4,spam,try chauncey may conferred the luscious not co...,1,909,try chauncey may conferred luscious continued ...
5,ham,"It's quiet. Too quiet. Well, how about a str...",0,94,quiet quiet Well straw poll many plan9 running...


In [19]:
type(stopwords.words('english'))

list

In [20]:
from collections import Counter

words = data[data.label=='ham'].clean_msg.apply(lambda x: [word.lower() for word in x.split()])
ham_words = Counter()

for msg in words:
    ham_words.update(msg)
    
print(ham_words.most_common(50))

[('would', 11217), ('one', 10590), ('1', 9768), ('use', 9579), ('id', 9096), ('list', 8441), ('received', 8388), ('subject', 8180), ('get', 8161), ('email', 7496), ('like', 7276), ('also', 7233), ('using', 7076), ('date', 6926), ('time', 6909), ('may', 6867), ('dmdx', 6750), ('new', 6567), ('file', 6536), ('message', 6450), ('0', 6151), ('wrote', 6113), ('information', 5919), ('university', 5727), ('could', 5639), ('know', 5564), ('nil', 5423), ('send', 5412), ('thanks', 5385), ('help', 5379), ('problem', 5201), ('board', 5191), ('please', 5141), ('10', 5129), ('2002', 5070), ('20', 5056), ('system', 5014), ('work', 5013), ('need', 4781), ('two', 4653), ('dmdxpsy1psycharizonaedu', 4652), ('mail', 4589), ('0700', 4565), ('see', 4494), ('used', 4405), ('3', 4347), ('available', 4343), ('first', 4329), ('way', 4216), ('make', 4166)]


In [21]:
words = data[data.label=='spam'].clean_msg.apply(lambda x: [word.lower() for word in x.split()])
spam_words = Counter()

for msg in words:
    spam_words.update(msg)
    
print(spam_words.most_common(50))

[('20', 5602), ('e', 3090), ('l', 2747), ('v', 2722), ('r', 2652), ('c', 2348), ('3', 2173), ('n', 2151), ('x', 1891), ('p', 1868), ('us', 1607), ('g', 1416), ('company', 1407), ('000', 1267), ('please', 1252), ('b', 1251), ('z', 1177), ('1', 1148), ('email', 1120), ('website', 1110), ('new', 1085), ('money', 1052), ('h', 1051), ('è', 1037), ('get', 1012), ('like', 988), ('one', 972), ('campaign', 955), ('f', 955), ('e8', 947), ('want', 942), ('ra', 938), ('td', 898), ('info', 882), ('gold', 881), ('hi', 872), ('8', 861), ('account', 836), ('best', 832), ('de', 819), ('product', 800), ('7', 799), ('time', 793), ('hoodia', 784), ('weight', 779), ('q', 777), ('may', 762), ('w', 761), ('also', 735), ('â', 713)]


# 🧮 Vectorization

> Currently, we have the messages as lists of tokens (also known as [lemmas](http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)) and now we need to convert each of those messages into a vector the SciKit Learn's algorithm models can work with.

> Now we'll convert each message, represented as a list of tokens (lemmas) above, into a vector that machine learning models can understand.

> We'll do that in three steps using the bag-of-words model:

> 1. Count how many times does a word occur in each message (Known as term frequency)
> 2. Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)
> 3. Normalize the vectors to unit length, to abstract from the original text length (L2 norm)

> Let's begin the first step:

> Each vector will have as many dimensions as there are unique words in the data corpus.  We will first use SciKit Learn's **CountVectorizer**. This model will convert a collection of text documents to a matrix of token counts.

> We can imagine this as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension are the actual documents, in this case a column per text message. 

> For example:

<table border = “1“>
<tr>
<th></th> <th>Message 1</th> <th>Message 2</th> <th>...</th> <th>Message N</th> 
</tr>
<tr>
<td><b>Word 1 Count</b></td><td>0</td><td>1</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word 2 Count</b></td><td>0</td><td>0</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>...</b></td> <td>1</td><td>2</td><td>...</td><td>0</td>
</tr>
<tr>
<td><b>Word N Count</b></td> <td>0</td><td>1</td><td>...</td><td>1</td>
</tr>
</table>


> Since there are so many messages, we can expect a lot of zero counts for the presence of that word in that document. Because of this, SciKit Learn will output a [Sparse Matrix](https://en.wikipedia.org/wiki/Sparse_matrix).

In [22]:
# split X and y into training and testing sets 
from sklearn.model_selection import train_test_split

# how to define X and y (from the data data) for use with COUNTVECTORIZER
X = data.clean_msg
y = data.label_num
print(X.shape)
print(y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(17587,)
(17587,)
(13190,)
(4397,)
(13190,)
(4397,)


> There are a lot of arguments and parameters that can be passed to the CountVectorizer. In this case we will just specify the **analyzer** to be our own previously defined function:

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

# instantiate the vectorizer
vect = CountVectorizer()
vect.fit(X_train)

# learn training data vocabulary, then use it to create a document-term matrix
X_train_dtm = vect.transform(X_train)

# equivalently: combine fit and transform into a single step
X_train_dtm = vect.fit_transform(X_train)


# examine the document-term matrix
print(type(X_train_dtm), X_train_dtm.shape)

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
print(type(X_test_dtm), X_test_dtm.shape)

<class 'scipy.sparse._csr.csr_matrix'> (13190, 244547)
<class 'scipy.sparse._csr.csr_matrix'> (4397, 244547)


In [24]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(X_train_dtm)
tfidf_transformer.transform(X_train_dtm)

<13190x244547 sparse matrix of type '<class 'numpy.float64'>'
	with 1346917 stored elements in Compressed Sparse Row format>

# 🤖 Building and evaluating a model

> We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [25]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [26]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_dtm, y_train)

CPU times: total: 31.2 ms
Wall time: 38 ms


In [27]:
from sklearn import metrics

# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

# calculate accuracy of class predictions
print("=======Accuracy Score===========")
print(metrics.accuracy_score(y_test, y_pred_class))

# print the confusion matrix
print("=======Confision Matrix===========")
metrics.confusion_matrix(y_test, y_pred_class)

# Multinomial Naive Bayes model

0.9515578803729816


array([[3048,   21],
       [ 192, 1136]], dtype=int64)

In [28]:
# print message text for false positives (ham incorrectly classifier)
# X_test[(y_pred_class==1) & (y_test==0)]
X_test[y_pred_class > y_test]

1105     unsubscribe Octopus traps summers moonspun dre...
6342     please see httpelwwwmediamiteduprojectshandybo...
34865    RGVhciBpcmMtbGlzdC13ZWJAc2tpZG1vcmUuZWR1LCB0aG...
6003     MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMI...
33013    Dear val heard sweepstakes marykateandashleyco...
36477    Hey Kellie Would mind sending one reminder Pro...
30131    RGVhciBpcmMtbGlzdC13ZWJAc2tpZG1vcmUuZWR1LCB0aG...
31240    nevoie sa fac procura pentru vanzarea unui apa...
16248    SGkhDQoNCkhhcyBhbnlvbmUgaWRlYSBhYm91dCB0aGUgbW...
1167     Hallo Da sind sie endlich die Skripte fuer den...
1508     Appel darticles La place du football dans la v...
32374    EstarE9 fuera de mi oficina hasta el 31 de oct...
32527    RGVhciBpcmMtbGlzdC13ZWJAc2tpZG1vcmUuZWR1LCB0aG...
6340     Nayant pas accE9s E0 un micro E9quipE9 dun lec...
25638    DOCTYPE HTML PUBLIC W3CDTD HTML 40 Transitiona...
16847    VIAGRA LINE Click http216342923viagra viagragu...
20801    hola estoy muy interesada en saber como puedo .

In [29]:
# print message text for false negatives (spam incorrectly classifier)
X_test[y_pred_class < y_test]

1525     Hi 20 V P C L V X r e b L v n z L G e R x n c ...
30776    Make relaxed interesting forever Jane sharing ...
19814    Hello contacting let know compact disc replica...
36338    Hail Erectile Dysfunction help site ochhorfand...
34142    see us answered Scarecrow Hornyh74 wanna meets...
                               ...                        
10856    精彩手机铃声图片免无限下载！ 每天都有互联网上最新的动感铃声，每天都有互联网上最新的原创图片...
8302                                    width3D0 height3D0
24575    3c21444f43545950452048544d4c205055424c49432022...
7210     set modern resource http69xanthationinfo nwjh ...
221                                                       
Name: clean_msg, Length: 192, dtype: object

In [30]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([7.29245402e-001, 9.99880896e-001, 4.90346662e-149, ...,
       1.00000000e+000, 1.00000000e+000, 2.99519492e-023])

In [31]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.9892865695430795

In [32]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

pipe = Pipeline([('bow', CountVectorizer()), 
                 ('tfid', TfidfTransformer()),  
                 ('model', MultinomialNB())])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

# calculate accuracy of class predictions
print("=======Accuracy Score===========")
print(metrics.accuracy_score(y_test, y_pred))

# print the confusion matrix
print("=======Confision Matrix===========")
metrics.confusion_matrix(y_test, y_pred)

# TfidfTransformer > MultinomialNB

0.8940186490789175


array([[3064,    5],
       [ 461,  867]], dtype=int64)

# 📊 Comparing models

We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):

> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [33]:
# import an instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(solver='liblinear')

# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

CPU times: total: 5 s
Wall time: 4.14 s


In [34]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([7.71725161e-01, 9.27654794e-01, 3.84094284e-14, ...,
       9.89891388e-01, 9.99629260e-01, 6.20172702e-05])

In [35]:
# calculate accuracy of class predictions
print("=======Accuracy Score===========")
print(metrics.accuracy_score(y_test, y_pred_class))

# print the confusion matrix
print("=======Confision Matrix===========")
print(metrics.confusion_matrix(y_test, y_pred_class))

# calculate AUC
print("=======ROC AUC Score===========")
print(metrics.roc_auc_score(y_test, y_pred_prob))


# LogisticRegression

0.9770297930407096
[[3001   68]
 [  33 1295]]
0.993309994621693


# 🧮 Tuning the vectorizer

Thus far, we have been using the default parameters of [CountVectorizer:](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [36]:
# show default parameters for CountVectorizer
vect

> 📌 However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:

> - 📌 **stop_words**: string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used.

In [37]:
# remove English stop words
vect = CountVectorizer(stop_words='english')

> - 📌 **ngram_range**: tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

In [38]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))

> - 📌 **max_df**: float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [39]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)

> - 📌 **min_df**: float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [40]:
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)

> - 📌 **Guidelines for tuning CountVectorizer**:
    - Use your knowledge of the problem and the text, and your understanding of the tuning parameters, to help you decide what parameters to tune and how to tune them.
    - Experiment, and let the data tell you the best approach!