# Naive Bayes

`P(S∣B)=[P(B∣S)P(S)]/[P(B∣¬S)P(¬S)+P(B∣S)P(S)]​`                            

The numerator is probability that a message is span and contains bitcoin, while the denominator is just the probability that All emails that contain “Bitcoin”, whether spam or not.

S = spam, B = Bitcoin and ¬S = Not Spam

Now imagine checking many words like:

 “bitcoin”, “rolex”, “free”, “offer”, “win”, etc.

Each word becomes a clue.

The question becomes:

 If an email contains several spam-like words, how likely is it spam?

 We define:

Xi = event that the message contains word i

Example:

X₁ → contains “bitcoin”

X₂ → contains “rolex”

We estimate:

P(Xi | S) → chance a spam email contains that word

P(Xi | ¬S) → chance a normal email contains that word

These are learned from past emails.

The KEY Assumption (Why it’s called Naive)

Naive Bayes assumes:

 Words behave independently.

Meaning:

If you already know the email has “bitcoin”, it tells you NOTHING about whether it also has “rolex”.

Here Independence is instead of calculating joint probability, we simply multiply:

`P(X1​,X2​,...,Xn​∣S)=P(X1​∣S)×P(X2​∣S)×...×P(Xn​∣S)`

Probability spam contains ALL these words = multiply the individual probabilities.


#### Underflow 

In this computers don't deal with floating-point numbers too close to 0.

example:
0.01 × 0.02 × 0.03 × 0.01 ...

after too many multiplication it becomes:
0.0000000000000003

Computers struggle with numbers this tiny.

###### To deal with this use :
`log(a × b) = log(a) + log(b)`
so instead multiplying `p1 × p2 × p3` 
We compute: `log(p1) + log(p2) + log(p3)`
To convert back use exp as: `exp(log(p1)+....log(pn)) = p1+......+pn`

## Smoothing
If spam word that is 'data' occurs in non spam messages during training then p(data|S) = 0. The result is that our Naive Bayes Classifier would always assign spam probability 0 to any message containing the word 'data'.

To avoid this we use SMOOTHING.
To fix this, we pretend we saw every word at least a few times.<br>
`P(Xi​∣S)= (k+number of spam containing word)/(2k+total spam)​` <br>
Here, usually k = 1, k is pseudecount, estimates the probaility of seeing the word in spam message


In [19]:
# Implementation 
"""
Tokenization in strings is the process of breaking a continuous text or string into smaller, meaningful units called "tokens" (such as words, numbers, or symbols) using defined delimiters like spaces, commas, or special characters.
"""
from typing import Set
import re
def tokenize(text:str) -> Set[str]:
    text = text.lower()       #convert to lowercase
    all_words = re.findall("[a-z0-9]+",text)      #extract the words
    return set(all_words)

assert tokenize("Data Science is Science") == {"data","science","is"}


# Defining our training data
from typing import NamedTuple

class Message(NamedTuple):
    text:str
    is_spam:bool


from typing import List, Tuple, Dict, Iterable
import math
from collections import defaultdict

class NaiveBayesClassifier:
    def __init__(self,k:float=0.5)->None:
        self.k = k       # Smoothing Factor

        self.tokens: Set[str] = set()
        self.token_spam_counts:Dict[str,int] = defaultdict(int)
        self.token_ham_counts: Dict[str,int] = defaultdict(int)
        self.spam_messages = self.ham_messages = 0                 # ham_messages = non-spam_messages

    def train(self,messages:Iterable[Message])->None:
        for message in messages:
            # Increment message counts
            if message.is_spam:
                self.spam_messages +=1
            else:
                self.ham_messages += 1

            # Increment word counts
            for token in tokenize(message.text):
                self.tokens.add(token)
                if message.is_spam:
                    self.token_spam_counts[token] += 1
                else:
                    self.token_ham_counts[token] += 1
                 
#Now we have to predict P(spam|token). To apply Bayes's theorem we need to know P(token|spam) and P(token|ham) for each   token in vocabulary. We will create 'private' helper function to compute this 

    def _probabilities(self,token:str) -> Tuple[float,float]:
        """ returns P(token/spam) and P(token/ham) """
        spam = self.token_spam_counts[token]
        ham = self.token_ham_counts[token]

        p_token_spam = (spam + self.k) / (self.spam_messages + 2 * self.k)
        p_token_ham = (ham + self.k) / (self.ham_messages + 2 * self.k)
        return p_token_spam, p_token_ham

# Predict method
    def predict(self,text:str)-> float:
        text_tokens = tokenize(text)
        log_prob_if_spam = log_prob_if_ham = 0.0
        
        # Iterate through each word in our vocabulary
        for token in self.tokens:
            prob_if_spam, prob_if_ham = self._probabilities(token)
            # If *token* appears in the message,
            # add the log probability of seeing it 
            if token in text_tokens:
                log_prob_if_spam += math.log(prob_if_spam)
                log_prob_if_ham += math.log(prob_if_ham)

            # Otherwise add the log probability of _not_ seeing it,
            # which is log(1-probability of seeing it)
            else:
                log_prob_if_spam += math.log(1-prob_if_spam)
                log_prob_if_ham += math.log(1-prob_if_ham)

        prob_if_spam = math.exp(log_prob_if_spam)
        prob_if_ham = math.exp(log_prob_if_ham)
        return prob_if_spam/(prob_if_spam+prob_if_ham)

## Testing Our Model

In [None]:
messages= [Message("spam rules", is_spam=True),
           Message("ham rules", is_spam=False),
           Message("hello ham", is_spam=False)]

model = NaiveBayesClassifier(k=0.5)
model.train(messages)

assert model.tokens == {"spam","ham","hello", "rules"}
assert model.spam_messages == 1
assert model.ham_messages == 2
assert model.token_spam_counts == {"spam":1,"rules":1}
assert model.token_ham_counts == {"ham":2,"rules":1,"hello":1}

## Using Our Model

In [None]:
from io import BytesIO     # so we can treat bytes as file
import requests
import tarfile              # are in .tar.bz format

BASE_URL = "https://spamassassin.apache.org/old/publiccorpus"
FILES = ["20021010_easy_ham.tar.bz2", 
         "20021010_hard_ham.tar.bz2", 
         "20021010_spam.tar.bz2"] 

OUTPUT_DIR = 'spam_data'
for filename in FILES:
    content = requests.get(f"{BASE_URL}/{filename}").content
    
    # Wrap the in-memory bytes so we can use them as a "file." 
    fin = BytesIO(content)

    # And extract all the files to the specified output dir.
    with tarfile.open(fileobj=fin,mode='r:bz2') as tf:
        tf.extractall(OUTPUT_DIR)

  tf.extractall(OUTPUT_DIR)


In [None]:
""" We look through the files, they all seem to start with 'Subject'. """

# modify the path to whenever you've put the files
import glob, re

path = "spam_data/*/*"

data:List[Message] = []

# glob.glob returns every filename that matches the wildcarded path
for filename in glob.glob(path):
    is_spam = "ham" not in filename

    # There are some garbage characters in the emails; the errors='ignore'
    # skips them instead of raising an exception
    with open(filename, errors='ignore') as email_file:
        for line in email_file:
            if line.startswith("Subject:"):
                subject = line.lstrip("Subject: ")
                data.append(Message(subject, is_spam))
                break 


# Now split the data
import random
from typing import TypeVar
X = TypeVar('X')

def split_data(data:List[X],prob:int)-> List[X]:
    data = data[:]
    cut = int(len(data)*prob)
    return data[:cut], data[cut:]

random.seed(0)
train_messages, test_messages = split_data(data,0.75)

model = NaiveBayesClassifier()
model.train(train_messages)

# Let's generate some predictions and check the model
from collections import Counter

predictions = [(message,model.predict(message.text))
               for message in test_messages] 

confusion_matrix = Counter((message.is_spam, spam_probability>0.5)
                           for message,spam_probability in predictions)

print(confusion_matrix)                             

Counter({(True, False): 500, (False, False): 325})


In [None]:
def p_spam_given_token(token:str, model:NaiveBayesClassifier)->float:
    prob_if_spam, prob_if_ham = model._probabilities(token)

    return prob_if_spam/(prob_if_spam+prob_if_ham)

words = sorted(model.tokens,key=lambda t:p_spam_given_token(t,model)) 
print("Spammiest_words",words[-10:])
print('hammiest_words',words[:10])

Spammiest_words ['done', 'scissors', 'outdoors', 'mice', 'passing', 'strength', 'epidemic', 'stepfather', 'curling', 'est']
hammiest_words ['re', 'the', 'for', 'of', 'to', 'in', 'a', 'satalk', 'spambayes', 's']
