
# Multinomial Naive Bayes Spam Classifier


In [6]:
import pandas as pd

## Topics

1. Representing text as numerical data
2. Reading a text-based dataset into pandas (1 point)
3. Vectorizing our dataset (1 point)
4. Building and evaluating a model (2 points)
5. Examining a model for further insight (1 point)
6. Tuning the vectorizer (5 points)

## Part 1: Representing text as numerical data

In [7]:
# example text for model training
simple_train = ['call you tonight', 'Call me a "cab"', 'please call me... PLEASE!']

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert text into a matrix of token counts":

In [8]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [9]:
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)

In [10]:
# examine the fitted vocabulary
vect.get_feature_names_out()

array(['cab', 'call', 'me', 'please', 'tonight', 'you'], dtype=object)

In [11]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [12]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 0, 0],
       [0, 1, 1, 2, 0, 0]], dtype=int64)

In [13]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,0,0,1,1
1,1,1,1,0,0,0
2,0,1,1,2,0,0


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [14]:
# check the type of the document-term matrix
type(simple_train_dtm)

scipy.sparse._csr.csr_matrix

In [15]:
# examine the sparse matrix contents
print(simple_train_dtm)

  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 2)	1
  (2, 1)	1
  (2, 2)	1
  (2, 3)	2


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [16]:
# example text for model testing
simple_test = ["please don't call me"]

In order to **make a prediction**, the new observation must have the **same features as the training observations**, both in number and meaning.

In [17]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 1, 1, 0, 0]], dtype=int64)

In [18]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,cab,call,me,please,tonight,you
0,0,1,1,1,0,0


**Summary:**

- `vect.fit(train)` **learns the vocabulary** of the training data
- `vect.transform(train)` uses the **fitted vocabulary** to build a document-term matrix from the training data
- `vect.transform(test)` uses the **fitted vocabulary** to build a document-term matrix from the testing data (and **ignores tokens** it hasn't seen before)

## Part 2: Reading a text-based dataset into pandas

In [19]:
# read file into pandas using a relative path
path = 'spam_ham.csv'
spam_ham = pd.read_csv(path, header=0, names=['label', 'location','message'])
spam_ham.drop('location', axis=1, inplace=True)

# Drop entries with null values using dropna(inplace=True) on your DataFrame (1 point)
print(spam_ham.shape) # Shape before dropping null values.
print(spam_ham.isnull().sum()) # Know where are the null values (if any).


spam_ham.dropna(inplace=True) # DROP the entries with null values.

print(spam_ham.shape) # Shape after dropping null values.
print(spam_ham.isnull().sum()) # Know where are the null values (if any).

FileNotFoundError: [Errno 2] No such file or directory: 'spam_ham.csv'

In [None]:
# examine the shape
spam_ham.shape

(30974, 2)

In [None]:
# examine the first 10 rows
spam_ham.head(10)

Unnamed: 0,label,message
0,spam,LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...
1,spam,Academic Qualifications available from prestig...
2,ham,Greetings all. This is to verify your subscrip...
3,spam,try chauncey may conferred the luscious not co...
4,ham,"It's quiet. Too quiet. Well, how about a straw..."
5,ham,It's working here. I have departed almost tota...
6,spam,The OIL sector is going crazy. This is our wee...
7,spam,Little magic. Perfect weekends.http://othxu.rz...
8,ham,Greetings all. This is a mass acknowledgement ...
9,spam,"Hi, L C P A X V V e I r m a A I v A o b n L A ..."


In [None]:
# examine the class distribution
spam_ham.label.value_counts()

spam    19280
ham     11694
Name: label, dtype: int64

In [None]:
# convert label to a numerical variable
spam_ham['label_num'] = spam_ham.label.map({'ham':0, 'spam':1})

In [None]:
# check that the conversion worked
spam_ham.head(10)

Unnamed: 0,label,message,label_num
0,spam,LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...,1
1,spam,Academic Qualifications available from prestig...,1
2,ham,Greetings all. This is to verify your subscrip...,0
3,spam,try chauncey may conferred the luscious not co...,1
4,ham,"It's quiet. Too quiet. Well, how about a straw...",0
5,ham,It's working here. I have departed almost tota...,0
6,spam,The OIL sector is going crazy. This is our wee...,1
7,spam,Little magic. Perfect weekends.http://othxu.rz...,1
8,ham,Greetings all. This is a mass acknowledgement ...,0
9,spam,"Hi, L C P A X V V e I r m a A I v A o b n L A ...",1


In [None]:
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = spam_ham.message
y = spam_ham.label_num
print(X.shape)
print(y.shape)

(30974,)
(30974,)


In [None]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=427)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(23230,)
(7744,)
(23230,)
(7744,)


## Part 3: Vectorizing our dataset

In [None]:
# instantiate the vectorizer
vect = CountVectorizer()

In [None]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)

In [None]:
# examine the fitted vocabulary
vect.get_feature_names_out()

array(['00', '000', '0000', ..., 'ｔ谷', 'ｗ６２', 'ｙ里様お互いがくつろげるような'],
      dtype=object)

In [None]:
# transform training data into a 'document-term matrix'
X_train_dtm = vect.transform(X_train)

In [None]:
# equivalently: combine fit and transform into a single step (1 point)
X_train_dtm = vect.fit_transform(X_train)

In [None]:
# examine the document-term matrix
X_train_dtm

<23230x161925 sparse matrix of type '<class 'numpy.int64'>'
	with 2305787 stored elements in Compressed Sparse Row format>

In [None]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)

## Part 4: Building and evaluating a model

We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [None]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [None]:
# train the model using X_train_dtm and the fit() method (1 point)
nb.fit(X_train_dtm, y_train)

In [None]:
# make class predictions for X_test_dtm, store it in y_pred_class (1 point)
y_pred_class = nb.predict(X_test_dtm)

In [None]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9847623966942148

In [None]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[2967,    7],
       [ 111, 4659]], dtype=int64)

In [None]:
# Print the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_class))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      2974
           1       1.00      0.98      0.99      4770

    accuracy                           0.98      7744
   macro avg       0.98      0.99      0.98      7744
weighted avg       0.99      0.98      0.98      7744



In [None]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([1.00000000e+00, 1.00000000e+00, 1.73677105e-31, ...,
       1.00000000e+00, 1.00000000e+00, 7.77424439e-01])

In [None]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.9967052681591261

## Part 5: Examining a model for further insight

We will examine the our **trained Naive Bayes model** to calculate the approximate **"spamminess" of each token**.

In [None]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names_out()
len(X_train_tokens)

161925

In [None]:
# examine the first 50 tokens
print(X_train_tokens[0:50])

['00' '000' '0000' '000000' '0000000' '00000000' '000000000'
 '00000000000000' '000000000000000000000000000000049999999999999e9'
 '0000000000000000000000000000000500000000000000e9' '0000000000000016666l'
 '0000000000000017d' '00000000000000e' '000000000000received'
 '000000000001received' '0000000001d0' '0000000001l0'
 '0000000010000000004l0' '0000000016' '000000001d0' '00000000message'
 '00000000x' '00000001' '00000001content' '00000001irdecode' '00000004'
 '00000010' '00000010pwm' '00000011' '00000049' '0000005' '0000006hz'
 '000000eb' '000001' '00000100' '00000100shaftencoder' '000001bdaaa0'
 '000001bdb744' '000001bdcaf3' '000001c642d0' '000001c64310'
 '000001c64585' '000001c64615' '000001c64641' '000001c6465f'
 '000001c6468e' '0000020' '0000040' '0000040b' '0000040c']


In [None]:
# examine the last 50 tokens
print(X_train_tokens[-50:])

['３ｄバーチャルｓｅｘメーカー' '４名紹介http' '４月度新規メンバー様応援企画パーティー開催予定' '４月２３日'
 '５月までのようなので興味のある方はお早めに' '５月中週末３人か４人ぐらいで' '５００円分のポイントが完全無料で自動追加されます'
 '６名のみとなります' 'ａ型' 'ａ美様現在未亡人でいらっしゃいます' 'ｂ９６' 'ｆカップ' 'ｇａｌ誌多数掲載' 'ｇｏｏｄ'
 'ｇｒａｎｄｅｅ' 'ｇｒａｎｄｅｅの理念やシステムのご紹介' 'ｇｗこそ出会いのチャンスhttp' 'ｇｗです'
 'ｇｗはこういう女の子と過ごしたいっす' 'ｇｗ特典あり' 'ｈなこと大好きな人ばかり' 'ｈな女の子が多いので' 'ｈな欲望や願望を胸に秘め'
 'ｈにそんなに興味なかったのと少し怖いのもあるため' 'ｈのお相手しただけで' 'ｈゲームメーカーの決定版' 'ｈ度'
 'ｈ目的の出会いも簡単です' 'ｈ８６' 'ｋ村' 'ｍ子様セーリングクルーザーをお持ちで' 'ｍ字開脚オナニーを机の下から盗撮'
 'ｍａｉｌでのサポートは２４時間対応です' 'ｎ藤' 'ｏｌ' 'ｐｃ' 'ｐｃから簡単プロフィール作成' 'ｓクラス専門店'
 'ｓ子様秘密が条件で' 'ｓｅｘを求めている' 'ｓｅｘを求めているのです' 'ｓｍ' 'ｔ165' 'ｔバックは'
 'ｔバックはいていたらおならが左右に分散するのでなんか変な感じですけどね' 'ｔバックを購入しました' 'ｔ島' 'ｔ谷' 'ｗ６２'
 'ｙ里様お互いがくつろげるような']


In [None]:
# Naive Bayes counts the number of times each token appears in each class
nb.feature_count_

array([[2.009e+03, 3.900e+02, 3.760e+02, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [3.637e+03, 5.304e+03, 0.000e+00, ..., 2.000e+00, 1.000e+00,
        2.000e+00]])

In [None]:
# rows represent classes, columns represent tokens
nb.feature_count_.shape

(2, 161925)

In [None]:
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]
ham_token_count

array([2009.,  390.,  376., ...,    0.,    0.,    0.])

In [None]:
# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]
spam_token_count

array([3.637e+03, 5.304e+03, 0.000e+00, ..., 2.000e+00, 1.000e+00,
       2.000e+00])

In [None]:
# create a DataFrame of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
tokens.head()

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2009.0,3637.0
0,390.0,5304.0
0,376.0,0.0
0,50.0,62.0
0,1.0,0.0


In [None]:
# examine 5 random DataFrame rows
tokens.sample(5, random_state=427)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
bench,14.0,7.0
bitmapsand,1.0,0.0
gosses,2.0,0.0
tome,4.0,9.0
1388,3.0,0.0


In [None]:
# Naive Bayes counts the number of observations in each class
nb.class_count_

array([ 8720., 14510.])

Before we can calculate the "spamminess" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**.

In [None]:
# add 1 to ham and spam counts to avoid dividing by 0 (1 point)
tokens.ham = tokens.ham + 1
tokens.spam = tokens.spam + 1

tokens.sample(5, random_state=427) # Check if 1 was successfully added.

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
bench,15.0,8.0
bitmapsand,2.0,1.0
gosses,3.0,1.0
tome,5.0,10.0
1388,4.0,1.0


In [None]:
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=427)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
bench,0.00172,0.000551
bitmapsand,0.000229,6.9e-05
gosses,0.000344,6.9e-05
tome,0.000573,0.000689
1388,0.000459,6.9e-05


In [None]:
# calculate the ratio of spam-to-ham for each token
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state=427)

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bench,0.00172,0.000551,0.320515
bitmapsand,0.000229,6.9e-05,0.300482
gosses,0.000344,6.9e-05,0.200322
tome,0.000573,0.000689,1.20193
1388,0.000459,6.9e-05,0.150241


In [None]:
# examine the DataFrame sorted by spam_ratio
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
tokens.sort_values('spam_ratio', ascending=False)

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
product_table,0.000115,0.640662,5586.569263
15px,0.000115,0.201447,1756.620262
0px,0.000459,0.726465,1583.692626
proms,0.000115,0.160510,1399.647140
professionaladobe,0.000115,0.160510,1399.647140
...,...,...,...
nodes,0.163532,0.000069,0.000421
node,0.166514,0.000069,0.000414
handy,0.167317,0.000069,0.000412
cert,0.192775,0.000069,0.000358


In [None]:
# look up the spam_ratio for a given token
tokens.loc['adobe', 'spam_ratio']

42.4499718062778

## Part 6: Tuning the vectorizer (5 points)

Thus far, we have been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):

In [None]:
# show default parameters for CountVectorizer
vect

However, the vectorizer is worth tuning, just like a model is worth tuning! Here are a few parameters that you might want to tune:

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used.

In [None]:
# remove English stop words
vect = CountVectorizer(stop_words='english')

- **ngram_range:** tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))

- **max_df:** float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [None]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)

- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

In [None]:
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)

**Guidelines for tuning CountVectorizer:**

- Use your knowledge of the **problem** and the **text**, and your understanding of the **tuning parameters**, to help you decide what parameters to tune and how to tune them.

Tasks:
1. **Experiment**, and let the data tell you the best approach!
2. Try to reduce or increase the features and get a better score on the previous model. 
    * Score above a 99.5%? :)

In [None]:
# Write your code here (5 points)
# Note: Previous accuracy score is 0.9847623966942148

vect_v2 = CountVectorizer(stop_words='english', ngram_range=(1, 15), max_df=0.7, min_df=1)

# Vectorize the data.
X_train_dtm_v2 = vect_v2.fit_transform(X_train)
X_test_dtm_v2 = vect_v2.transform(X_test)

# Fit the data to the model.
nb.fit(X_train_dtm_v2, y_train)
y_pred_class_v2 = nb.predict(X_test_dtm_v2)

# Check accuracy.
metrics.accuracy_score(y_test, y_pred_class_v2)

# Increasing the ngram range gives better results but takes a longer time to compute.

0.9945764462809917