#### Let's create a simple function to tokenise messages into distinct words. We'll first convert each message to lowercase, then use `re.findall` to extract "words" consisting of letters, numbers and apostrophes. Finally, we'll use `set` to get just the distinct words:

In [2]:
from typing import Set
import re

def tokenize(text: str) -> Set[str]:
    text = text.lower()
    all_words = re.findall("[a-z0-9']+", text)
    return set(all_words)

print(tokenize("Data science is science"))

{'is', 'science', 'data'}


#### We'll also define a type for our training data:

In [3]:
from typing import NamedTuple

class Message(NamedTuple):
    text: str
    is_spam: bool

#### The constructor will take just one parameter, the pseudocount to use when computing probabilities. It also initialises an empty set of tokens, counters to track how often each token is seen in spam messages and non-spam messages, and counts how many spam and non-spam messages it was trained on:

In [5]:
from typing import List, Tuple, Dict, Iterable
import math
from collections import defaultdict

class NaiveBayesClassifier:
    def __init__(self, k: float = 0.5) -> None:
        self.k = k # smoothing factor
        self.tokens: Set[str] = set()
        self.tokens_spam_counts: Dict[str, int] = defaultdict(int)
        self.tokens_ham_counts: Dict[str, int] = defaultdict(int)
        self.spam_messages = self.ham_messages = 0


#### Next, we'll give it a method to train it on a bunch of messages. First, we increment the `spam_messages` and `ham_messages` counts. Then we tokenize each message text, and for each token we increment the `token_spam_counts` or `token_ham_counts` based on the message type:

In [6]:
    def train(self, messages: Iterable[Message]) -> None:
        for message in messages:
            # increment message counts
            if message.is_spam:
                self.spam_messages += 1
            else:
                self.ham_messages += 1
            # Increment word counts
            for token in tokenize(message.text):
                self.tokens.add(token)
                if message.is_spam:
                    self.tokens_spam_counts[token] += 1
                else:
                    self.tokens_ham_counts[token] += 1