## Lesson 1

1\. Describe in writing what an assumption is made when a naive Bayes classifier is created. Why the classifier is naive?

Assumption in naive Bayes classifier implies that objects in data is independent, that means that $p(F_{i} | C, F_{j}) = p(F_{i} | C)$ for i != j.

From this it is esay to understand why it is naive, because objects can not be fully independent and as a result this algo is called 'naive', but it works, so let it be used :)

3\. Make a copy of a naive Bayes classifier that we used above to create a spam filter and try to improve its performance.
Split the data set into training, validation and test data. Select the best model using the validation dataset and then compute your final score on the testing data. To improve the model for example the whole message content can be taken into account instead of the subject only. Also lengths of tokens that are taken into account can be varied. May be it would be interesting to split the messages into digramms: couples of words going one after another. And so on.

In [1]:
from collections import defaultdict
import re
import numpy as np

class NaiveBayes:
    def __init__(self, k, drop_short):
        """
        k - pseudocount, usually 1
        drop_short - drop too short tokens
        """
        self.k = k
        self.vocab = set()  # vocabulary, i.e., set of all seen tokens
        self.token_in_spam = defaultdict(int)   # counters of tokesn in spam
        self.token_in_ham  = defaultdict(int)    # ... and in ham messages
        self.pcond_spam    = self.pcond_ham = None # conditional probabilities of tokesn, will be computed after training
        self.spam_total    = self.ham_total = 0    # total number of spam and ham messages
        self.p_spam_total  = self.p_ham_total = None  # marginal probailities of spam and ham messages
        self.re_token      = re.compile(r"[a-z']+")  # regex to extarct tokens
        self.drop_short    = drop_short  # lengths of short tokens to drop out
        
    def _text2tokens(self, text):
        """Convert a text to a list of tokens. 
        We take just the first line of a message that contains a word Subject"""
        text_lower = text.lower()
        s          = text_lower.splitlines()[0]
        text_lower = s.replace('Subject: ', '')
        all_tokens = self.re_token.findall(text_lower)
        unique_tokes = list(set(all_tokens))
        good_tokens  = [tok for tok in unique_tokes if len(tok) > self.drop_short]
        return good_tokens
    
    def fit(self, messages, labels):
        """Training: computing the probailities for each token 
        to be enoucontered in spam and ham messages.
        """
        
        # Count tokens in spam and in ham messages
        for mes, lab in zip(messages, labels):
            tokens = self._text2tokens(mes)
            if lab == 'spam':
                self.spam_total += 1
                for tok in tokens:
                    self.token_in_spam[tok] += 1
            else:
                self.ham_total += 1
                for tok in tokens:
                    self.token_in_ham[tok] += 1
            self.vocab.update(tokens)

        # Compute probabilities
        self.pcond_spam = defaultdict(int)
        self.pcond_ham = defaultdict(int)
        for tok in self.vocab:
            self.pcond_spam[tok] = (self.token_in_spam[tok] + self.k) / (self.spam_total + 2 * self.k)
            self.pcond_ham[tok]  = (self.token_in_ham[tok] + self.k) / (self.ham_total + 2 * self.k)
        self.p_spam_total = self.spam_total / (self.spam_total + self.ham_total)
        self.p_ham_total  = 1 - self.p_spam_total
        
    def predict(self, messages):
        """Prediction: computing labels for messages.
        """
        pred = []
        for mes in messages:
            message_tokens = self._text2tokens(mes)
            log_sum_spam   = np.log(self.p_spam_total)  # collect probailities for spam 
            log_sum_ham    = np.log(self.p_ham_total)    # ... and ham messages
            for tok in self.vocab:
                p_spam = self.pcond_spam[tok] 
                p_ham  = self.pcond_ham[tok]
                if tok not in message_tokens:  # if the token absent in the message we take complememnt probailities
                    p_spam = 1 - p_spam
                    p_ham  = 1 - p_ham
                log_sum_spam += np.log(p_spam)
                log_sum_ham  += np.log(p_ham)
            # Make a desision, spam or ham
            pred.append('spam' if log_sum_spam > log_sum_ham else 'ham')
        return pred
    
    def explore_vocab(self):
        """Make a predicition for every token separately to see
        how they influnce the prediction.
        """
        spam_words = []
        for tok in self.vocab:
            p_spam = self.pcond_spam[tok] * self.p_spam_total
            p_ham = self.pcond_ham[tok] * self.p_ham_total
            if p_spam > p_ham:
                spam_words.append([tok, p_spam])
                
        spam_words = sorted(spam_words, key=lambda x: -x[1])
        words_only = [s[0] for s in spam_words]
        return words_only

In [2]:
import csv
import requests
from io import BytesIO, TextIOWrapper
from zipfile import ZipFile

def load_zipcsv_categorical(file_name):
    """Downloads zipped csv dataset from repo and return it as a nested list."""
    base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"
    web_data = requests.get(base_url + file_name)
    assert web_data.status_code == 200

    # unzip the content
    zf = ZipFile(BytesIO(web_data.content))
    
    # zipped file name
    zipped_name = zf.namelist()[0]
    print(f"Download {file_name}, unzip {zipped_name}")
    
    # Open unpacked file
    with zf.open(zipped_name, 'r') as file:
        # TextIOWrapper(file) converts byte strings to plain strings
        reader = csv.reader(TextIOWrapper(file), delimiter=',')
        data = []
        for row in reader:
            data.append(row)
    return data

raw_data = load_zipcsv_categorical("spam_and_ham.zip")

Download spam_and_ham.zip, unzip spam_ham_dataset.csv


In [3]:
data_lab = [row[1] for row in raw_data[1:]]
data_mes = [row[2] for row in raw_data[1:]]

In [32]:
from sklearn.model_selection import train_test_split

p_test = 0.1
n_test = round(p_test * len(data_lab))

X_train, X_test, y_train, y_test = train_test_split(data_mes, data_lab, random_state=0, 
                                                    test_size=n_test, shuffle=True)

print(f"train size {len(y_train)}")
print(f" test size {len(y_test)}")

train size 4654
 test size 517


In [49]:
nbc = NaiveBayes(k=1, drop_short=2)
nbc.fit(X_train, y_train)

In [50]:
y_pred = nbc.predict(X_test)

In [51]:
from sklearn import metrics

acc = metrics.accuracy_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred, average='binary', pos_label='spam')

print(f"Accuracy = {acc:.4f}")
print(f"F1-score = {f1:.4f}")

Accuracy = 0.8627
F1-score = 0.7149


In [52]:
prec, rec, f1, _ = metrics.precision_recall_fscore_support(y_test, y_pred, average='binary', pos_label='spam')

print(f"Precision = {prec:.4f}")
print(f"Recall    = {rec:.4f}")
print(f"F1-score  = {f1:.4f}")

Precision = 0.8900
Recall    = 0.5973
F1-score  = 0.7149


initial scores:
|          |       |
|----------|-------|
|Accuracy  | 0.8627|
|F1-score  | 0.7149|
|Precision | 0.8900|
|Recall    | 0.5973|

In [None]:
# I couldn't improve the accuracy, so I'll do another task)

5\. Previously we discussed that in the most cases data must be standardized before creation of a machine learning model. Why it does not influences the performance of a Gaussian naive Bayes classifier?

outliers (vibros) are filtered out by the algorithm