Logs   
- [2023/03/08]   
  Restart this notebook if you change the scratch library

- [2024/03/22]   
  You do not need to restart this notebook when you change the scratch library

To do:  
- Make the explanation aligns with the simple example from StatQuest with Josh Starmer   
  [Naive Bayes, Clearly Explained](https://www.youtube.com/watch?v=O2L2Uv9pdDA)

In [2]:
import re
import glob
import numpy as np
import requests
import tarfile
import matplotlib.pyplot as plt

from typing import Set, NamedTuple, List, Tuple, Dict, Iterable 
from collections import defaultdict, Counter
from io import BytesIO
from scratch.machine_learning import MachineLearning as ml

In [3]:
plt.rcParams.update(plt.rcParamsDefault)
plt.rcParams.update({
  'font.size': 16,
  'grid.alpha': 0.25})

In [4]:
%load_ext autoreload
%autoreload 2 

## A Realy Dumb Spam Filter

$S$ be the event "the message is spam."   
$B$ be the event "the message contains the word *bitcon*."

Conditional probability that the message is spam conditional on 
containing the word *bitcoin* (using Bayes' theorem)  

$$
  P(S|B) 
    = \frac{P(B|S) P(S)}{ P(B) }
    = \frac{P(B|S) P(S)}{ P(B|S) P(S) + P(B|\neg S) P(\neg S)}
$$

numerator: the probability that a message is spam *and* contains *bitcoin*  
denominator: the probability that a message contains *bitcoin*.

If we have a large collection of messages we know are spam, and a large
collection of messages we know are not spam, then we can easily
estimate $P(B|S)$ and $P(B|\neg S)$. If we further assume that any
message is equally likely to be spam are not spam 
(so that $P(S) = P(\neg S) = 0.5$), then

$$
\begin{align}
  P(S|B) &= \frac{P(B|S) 0.5}{ P(B|S) 0.5 + P(B|\neg S) 0.5} \nonumber \\
         &= \frac{P(B|S)}{P(B|S) + P(B|\neg S)}
\end{align}
$$

If 50% of spam messages have the word *bitcoin*, but  only 1% of nonspam
messages do, then the probability that any given *bitcon*-containing
email is spam is:

$$
  P(S|B) = \frac{0.5}{0.5 + 0.01} = 98\%
$$

Why is this classifier so dumb? because it only uses a word "bitcoin"    
and with a high confidence (98%) if a message containing a word "bitcoin",    
this spam filter predicts it 98% most of the time as a spam message   

## A More Sophisticated Spam Filter

$X_i$ is the event "a message contains the word $w_i$."     
$P(X_i | S)$ is the probability that a spam message contains the $i$-th word     
$P(X_i | \neg S)$ is the probability that a nonspam message contains the $i$-th word.

The key to Naive Bayes is making the (big) assumption that the presences   
(or absences) of each word are *independent* of one another, *conditional* on   
a message being spam or not.

Intuitvely, this assumption means that knowing whether a certain spam message   
contains the word **bitcon** gives us **no information** about whether the same   
message contains the word **rolex**.

In math, we can write

$$
  P(X_1 = x_1, \ldots, X_n = x_n | S) 
    = P(X_1 = x_1 | S) \times \cdots \times P(X_n = x_n | S)
$$


Multiplying many probabilities will give raise a problem of *underflow*   
(not _overflow_, because the probability is less than 1).  More friendly approach   
to multiply probabilities, is to use the identities $p_i = \exp\{\log(p_i)\}$.   
Then we can transform multiplication into addition

$$
  p_i \times \cdots \times p_n 
    = \exp\{\log(p_1) + \ldots + \log(p_n)\}
$$

Only one problem left to use the above formula. Imagine that in our training set    
the vocabulary word _data_ only occurs in nonspam message. Then we would estimate   
$P(\textrm{data}|S) = 0$. The result is that our Naive Bayes classifier would always  
assign spam probability 0 to _any_message containing the word _data_, even  
a message like "data on free bitcon and authentic rolex watches". To avoid this   
problem, we need some kind of smoothing, which is a concept of _pseducount_ enter  
the game.

*pseudocount* is a way to avoid an extreme condition (probability zero) when   
an event doesn't  occur given another event happens, but it occurs given
another even does not happe or vice versa.  

*pseudocount* $k$ can be formulated like this
$$
  P(X_i|S) = \frac{k + \text{number of spams containing } w_i}{2k + \text{number of spams}}
$$
We also do the same for $P(X_i | \neg S)$.   
When computing the spam probabilities for the $i$-th word, we assume we also saw  
$k$ additional nonspams containing the word and $k$ additional nonspams not   
containing the word.

In [6]:
seed = 24_03_27
rng = np.random.default_rng(seed)
p_i = rng.random(100)

In [7]:
np.prod(p_i)

2.0616069349833837e-42

In [10]:
np.sum(np.log(p_i))

-95.98508816151549

In [8]:
np.exp(np.log(2))

2.0

## Implementation

In [4]:
# Define a tokenization
def tokenize(text: str) -> Set[str]:
  text = text.lower()                         # convert to lowercase
  all_words = re.findall("[a-z0-9]+", text)   # extract the words, and
  return set(all_words)


tokenize("Data Science is science")

{'data', 'is', 'science'}

In [5]:
# Define a type of the training data
class Message(NamedTuple):
  text: str
  is_spam: bool

We name the nonspam emails has *ham* emails. We also make a class
for the classifier

In [6]:
class NaiveBayesClassifier:
  def __init__(self, k: float = 0.5) -> None:
    self.k = k    # smoothing factor / pseudocount

    self.tokens: Set[str] = set()
    self.token_spam_counts: Dict[str, int] = defaultdict(int)
    self.token_ham_counts: Dict[str, int] = defaultdict(int)
    self.spam_messages = self.ham_messages = 0

  
  def train(self, messages: Iterable[Message]) -> None:
    for message in messages:
      # Increment message counts
      if message.is_spam:
        self.spam_messages += 1
      else:
        self.ham_messages += 1

      # Increment word counts
      for token in tokenize(message.text):
        self.tokens.add(token)
        if message.is_spam:
          self.token_spam_counts[token] += 1
        else:
          self.token_ham_counts[token] += 1


  def _probabilities(self, token: str) -> Tuple[float, float]:
    """returns P(token | spam) and P(token | ham)""" 
    spam = self.token_spam_counts[token]
    ham = self.token_ham_counts[token]

    p_token_spam = (spam + self.k) / (self.spam_messages + 2 * self.k)
    p_token_ham = (ham + self.k) / (self.ham_messages + 2 * self.k)

    return p_token_spam, p_token_ham


  def predict(self, text: str) -> float:
    text_tokens = tokenize(text)
    log_prob_if_spam = 0.0
    log_prob_if_ham = 0.0

    # Iterate through each word in our vocabulary
    for token in self.tokens:
      prob_if_spam, prob_if_ham = self._probabilities(token)

      # If `token` appears in the message,
      # add the log probability of `seeing`` it
      if token in text_tokens:
        log_prob_if_spam += np.log(prob_if_spam)
        log_prob_if_ham += np.log(prob_if_ham)

      # Otherwise add the log probability of `not seeing` it,
      # which is log(1 - probability of seeing it)
      else:
        log_prob_if_spam += np.log(1.0 - prob_if_spam)
        log_prob_if_ham += np.log(1.0 - prob_if_ham)

    
    prob_if_spam = np.exp(log_prob_if_spam)
    prob_if_ham = np.exp(log_prob_if_ham)

    return prob_if_spam / (prob_if_spam + prob_if_ham)

## Testing Our Model

Test the model to some unit tests

In [7]:
[["spam rules",  True],
 [ "ham rules", False],
 [ "hello ham", False]]

[['spam rules', True], ['ham rules', False], ['hello ham', False]]

In [11]:
messages = [Message("spam rules", True),
            Message("ham rules", False),
            Message("hello ham", False)]

model = NaiveBayesClassifier(k=0.5)
model.train(messages)

In [12]:
messages

[Message(text='spam rules', is_spam=True),
 Message(text='ham rules', is_spam=False),
 Message(text='hello ham', is_spam=False)]

Let's check that it got the counts right

In [14]:
print("vocabularies", model.tokens)
print("spam", model.spam_messages)
print("ham", model.ham_messages)
print("spam", model.token_spam_counts)
print("ham", model.token_ham_counts)

vocabularies {'ham', 'rules', 'hello', 'spam'}
spam 1
ham 2
spam defaultdict(<class 'int'>, {'rules': 1, 'spam': 1})
ham defaultdict(<class 'int'>, {'ham': 2, 'rules': 1, 'hello': 1})


Let's make a prediction. We also (laboriously) go through our Naive Bayes logic by hand.   
In here, we set $k = 0.5$

$$
  P(X_i|S) = \frac{k + \text{number of spams containing } w_i}{2k + \text{number of spams}} \\[12pt]
  P(X_i|\neg S) = \frac{k + \text{number of ham containing } w_i}{2k + \text{number of hams}}
$$

In [10]:
text = "hello spam"

# test for all vocabularies: {"spam", "ham", "rules", "hello"}
probs_if_spam = [
  (1 + 0.5) / (1 + 2 * 0.5),          # "spam"   (present in "text")
  1 - (0 + 0.5) / (1 + 2 * 0.5),      # "ham"    (not present in "text")
  1 - (1 + 0.5) / (1 + 2 * 0.5),      # "rules"  (not present in "text")   # number of spams with "rules"
  (0 + 0.5) / (1 + 2 * 0.5)           # "hello"  (present)  # number of spams with "hello" is 0
]

probs_if_ham = [
  (0 + 0.5) / (2 + 2 * 0.5),          # "spam"   (present)
  1 - (2 + 0.5) / (2 + 2 * 0.5),      # "ham"    (not present)
  1 - (1 + 0.5) / (2 + 2 * 0.5),      # "rules"  (not present)
  (1 + 0.5) / (2 + 2 * 0.5)           # "hello"  (present)
]

p_if_spam = np.exp(sum(np.log(p) for p in probs_if_spam))
p_if_ham = np.exp(sum(np.log(p) for p in probs_if_ham))

print(model.predict(text))
print(p_if_spam / (p_if_spam + p_if_ham))

0.8350515463917525
0.8350515463917525


## Using Our Model 

Download and unpack spam dataset from [SpamAssassin public corpus](https://spamassassin.apache.org/old/publiccorpus/)

In [11]:
BASE_URL = "https://spamassassin.apache.org/old/publiccorpus"
FILES = ["20021010_easy_ham.tar.bz2",
         "20021010_hard_ham.tar.bz2",
         "20021010_spam.tar.bz2"]

# This is where the data will end up
# in /spam, /easy_ham, and /hard_ham subdirectories.
# Change this to where you want the data.
OUTPUT_DIR = 'spam_data'

for filename in FILES:
  # Use requests to get the file contents at each URL
  content = requests.get(f"{BASE_URL}/{filename}").content 

  # Wrap the in-memory bytes so we can use them as a "file."
  fin = BytesIO(content)

  # And extract all the files to the specified output dir.
  with tarfile.open(fileobj=fin, mode="r:bz2") as tf:
    tf.extractall(OUTPUT_DIR)

To keep things *really* simple, we'll just look at the subject lines
each email.

In [12]:
# modify the path to whatever you've put the files
path = "spam_data/*/*"

data: List[Message] = []

# -- computational time 32 secs
# glob.glob returns every filename that matches the wildcarded path
for filename in glob.glob(path):
  is_spam = "ham" not in filename

  # There are some garbage characters in the emails; the errors='ignore'
  # skips them instead of raising an exception.
  with open(filename, errors='ignore') as email_file:
    for line in email_file:
      if line.startswith("Subject:"):
        subject = line.lstrip("Subject: ")
        data.append(Message(subject, is_spam))
        break   # done with this file

Build the dataset by splitting into training and testing.
After that we ready to train the dataset with the classifier

In [13]:
seed = 2023_04_19
rng = np.random.default_rng(seed)
train_messages, test_messages = ml.split_data(data, 0.75, rng)

model = NaiveBayesClassifier()
model.train(train_messages)

Let's generate some predictions and check how the model does

In [14]:
predictions = [(message, model.predict(message.text))
               for message in test_messages]

# Assume that spam_probability > 0.5 corresponds to spam prediction
# and count the combinations of (actual is_spam, predicted is_spam)
confusion_matrix = Counter((message.is_spam, spam_probability > 0.5)
                           for message, spam_probability in predictions)

print(confusion_matrix)

Counter({(False, False): 676, (True, True): 78, (True, False): 41, (False, True): 30})


In [15]:
print(f"spam classified as `spam` (true positive): {confusion_matrix[True, True]:>3d}")
print(f"ham classified as `spam` (false positive): {confusion_matrix[False, True]:>3d}")
print(f"ham classified as `ham`   (true negative): {confusion_matrix[False, False]:>3d}")
print(f"spam classified as `ham` (false negative): {confusion_matrix[True, False]:>3d}")


spam classified as `spam` (true positive):  78
ham classified as `spam` (false positive):  30
ham classified as `ham`   (true negative): 676
spam classified as `ham` (false negative):  41


In [16]:
true_positive = confusion_matrix[True, True]
false_positive = confusion_matrix[False, True]
false_negative = confusion_matrix[True, False]

print(f"Precision: {true_positive / (true_positive + false_positive):.2f}")
print(f"Recall: {true_positive / (true_positive + false_negative):.2f}")

Precision: 0.72
Recall: 0.66


In [17]:
def p_spam_given_token(token: str, model: NaiveBayesClassifier) -> float:
  # We probabily shouldn't call private methods, but it's for a good cause
  prob_if_spam, prob_if_ham = model._probabilities(token)

  return prob_if_spam / (prob_if_spam + prob_if_ham)


words = sorted(model.tokens, key=lambda t: p_spam_given_token(t, model))

print(f"spammiest_words {words[-10:]}")
print(f"hammiest_words {words[:10]}")

spammiest_words ['needed', 'assistance', 'mortgage', 'attn', 'clearance', 'sale', 'systemworks', 'money', 'rates', 'adv']
hammiest_words ['spambayes', 'users', 'was', 'razor', 'zzzzteana', 'sadev', 'apt', 'ouch', 'perl', 'bliss']


A better improvement of spam filter above
1. Include message body, not only message subject line
2. Accept a token that has occurrence above some threshold
3. Similarity word should be reduced to its basic form (do stemming).
   Popular algorithm using Porter stemmer algorithm
4. Try another feature instead an event of "message contains word $w_i$."
   We can improve by adding number as a feature.
   More complex architecture, of course, using deep learing. 
   We will save it for another time.