# Homework 2

## 2.1
Below is the code for a spam mail filter as specified at http://www.paulgraham.com/spam.html. I took the approach of
building a class for the filter which is initialized with a corpus of regular mail and a corpus of spam mail. When
initialized, it computes the probabilities that a token that is in one of the corpora is spam. That information is then
used to compute the probability that a new message that is sent to the filter is spam.


In [1]:
class Filter:
    """A filter for spam mail. Takes a corpus of regular and spam mail and
    calculates the probability that a new message is spam based on the
    common words in each
    """

    def __init__(self, mail_corpus, spam_corpus):
        self.mail = mail_corpus
        self.spam = spam_corpus
        self.good = {}
        self.bad = {}
        self.probability = {}
        self.create_mail_hash(mail_corpus)
        self.create_spam_hash(spam_corpus)
        self.create_probability_hash()

    def create_mail_hash(self, mail_corpus):
        """Creates a dictionary of the number of occurrences of tokens from regular mail"""
        for message in mail_corpus:
            for word in message:
                wordVal = self.good.get(word, 0)
                self.good[word] = wordVal + 1

    def create_spam_hash(self, spam_corpus):
        """Creates a dictionary of the number of occurrences of tokens from spam mail"""
        for message in spam_corpus:
            for word in message:
                wordVal = self.bad.get(word, 0)
                self.bad[word] = wordVal + 1

    def create_probability_hash(self):
        """Creates a dictionary of the probability that a word is spam"""
        for word in self.good:
            self.probability[word] = self.compute_probability(word)

        for word in self.bad:
            if self.good.get(word, 0) == 0:
                self.probability[word] = self.compute_probability(word)

    def compute_probability(self, word):
        """Computes the probability that a word is spam using Bayesian statistics"""
        g = b = 0
        ngood = len(self.mail)
        nbad = len(self.spam)
        if self.good.get(word, 0) > 0:
            g = 2 * self.good[word]
        if self.bad.get(word, 0) > 0:
            b = self.bad[word]
        if g + b >= 1:
            return max(0.01, min(0.99, float(min(1, b / nbad) / (min(1, g / ngood) + min(1, b / nbad)))))
        return 0

    def filter(self, message):
        """Computes the probability that a new message is spam"""
        message_probs = []
        for word in message:
            word_prob = self.compute_probability(word)
            if word_prob == 0:
                word_prob = 0.4
            message_probs.append(word_prob)

        prod = prod_comp = 1
        for prob in message_probs:
            prod *= prob
            prod_comp *= 1 - prob
        return prod / (prod + prod_comp)

    def get_probabilities(self):
        return self.probability
    

This filter uses a Bayesian approach for the SPAM model because the filter can combine new data with existing data to
compute and update probabilities simply by adding more messages to each corpus. This way the filter can adjust the
probabilities when new spam information is received.

In [2]:
spam_corpus = [["I", "am", "spam", "spam", "I", "am"], ["I", "do", "not", "like", "that", "spamiam"]]
ham_corpus = [["do", "i", "like", "green", "eggs", "and", "ham"], ["i", "do"]]
spam_filter = Filter(ham_corpus, spam_corpus)

# Print the table of probabilities
print(spam_filter.get_probabilities())

# Print the probability that each message is spam
print(spam_filter.filter(spam_corpus[0]))
print(spam_filter.filter(spam_corpus[1]))
print(spam_filter.filter(ham_corpus[0]))
print(spam_filter.filter(ham_corpus[1]))

{'do': 0.3333333333333333, 'i': 0.01, 'like': 0.3333333333333333, 'green': 0.01, 'eggs': 0.01, 'and': 0.01, 'ham': 0.01, 'I': 0.99, 'am': 0.99, 'spam': 0.99, 'not': 0.99, 'that': 0.99, 'spamiam': 0.99}
0.9999999999989378
0.9999999583591874
2.6288392819642677e-11
0.005025125628140704


These probabilities are really close to either extreme, which makes sense since they were used to initialize the filter.

# 2.2

In [3]:
from probability import BayesNet, enumeration_ask

# Utility variables
T, F = True, False

grass = BayesNet([
    ('Cloudy', '', 0.5),
    ('Sprinkler', 'Cloudy', {T: 0.10, F: 0.50}),
    ('Rain', 'Cloudy', {T: 0.80, F: 0.20}),
    ('WetGrass', 'Sprinkler Rain', {(T, T): 0.99, (T, F): 0.9, (F, T): 0.9, (F, F): 0.0})
    ])

b. The number of independent values in the joint probability distribution = 2<sup>4</sup> = 16 since there are 4
independent variables in the distribution.

c. The number of independent values in the Bayesian Network = 9 since the causal relationships in the network simplify
it, allowing for less stored values than for a joint probability distribution.

$\begin{aligned}
    \textbf{P}(Cloudy)
    \end{aligned}$

Given in table: <0.5, 0.5>

In [4]:
print(enumeration_ask('Cloudy', dict(), grass).show_approx())

False: 0.5, True: 0.5


$\begin{aligned}
    \textbf{P}(Sprinkler | Cloudy)
    \end{aligned}$

Given in table: <0.1, 0.9>

In [5]:
print(enumeration_ask('Sprinkler', dict(Cloudy=T), grass).show_approx())

False: 0.9, True: 0.1


$\begin{aligned}
    \textbf{P}(Cloudy | Sprinkler \land \neg Rain)
        &= \alpha *\textbf{P}(C, S, \neg R) \\
        &= \alpha *\textbf{P}(C) * P(S|C) * P(\neg R|C) \\
        &= \alpha *\langle (0.5 * 0.1 * 0.2), (0.5 * 0.5 * 0.8) \rangle \\
        &= \alpha *\langle 0.01, 0.2 \rangle \\
        &= \langle 0.0476, 0.952 \rangle
    \end{aligned}$

In [6]:
print(enumeration_ask('Cloudy', dict(Sprinkler=T, Rain=F), grass).show_approx())

False: 0.952, True: 0.0476


$\begin{aligned}
    \textbf{P}(WetGrass | Cloudy \land Sprinkler \land Rain)
        &= \alpha * \textbf{P}(WG, C, S, R) \\
        &= \alpha * \textbf{P}(WG|S \land R) * P(S|C) * P(R|C) * P(C) \\
        &= \alpha * \langle (0.99 * 0.1 * 0.8 * 0.5), (0.01 * 0.1 * 0.8 * 0.5) \rangle \\
        &= \alpha * \langle 0.0396, 0.0004 \rangle \\
        &= \langle0.99, 0.01\rangle
    \end{aligned}$

In [7]:
print(enumeration_ask('WetGrass', dict(Cloudy=T, Sprinkler=T, Rain=T), grass).show_approx())

False: 0.01, True: 0.99


$\begin{aligned}
    \textbf{P}(Cloudy | \neg WetGrass)
        &= \alpha * \sum_S\sum_R\textbf{P}(C,S,R,\neg WG) \\
        &= \alpha * \sum_S\sum_R\textbf{P}(C) * P(S|C) * P(R|C) * P(\neg WG|S,R) \\
        &= \alpha * \langle (P(C) * P(S|C) * P(R|C) * P(\neg WG|S,R) + P(C) * P(\neg S|C) * P(R|C) * P(\neg WG|\neg S,R) \\
        &+ P(C) * P(S|C) * P(\neg R|C) * P(\neg WG|S,\neg R) + P(C) * P(\neg S|C) * P(\neg R|C) * P(\neg WG|\neg S,\neg R)), \\
        &P(\neg C) * P(S|\neg C) * P(R|\neg C) * P(\neg WG|S,R) + P(\neg C) * P(\neg S|\neg C) * P(R|\neg C) * P(\neg WG|\neg S,R) \\
        &+ P(\neg C) * P(S|\neg C) * P(\neg R|\neg C) * P(\neg WG|S,\neg R) + P(\neg C) * P(\neg S|\neg C) * P(\neg R|\neg C) * P(\neg WG|\neg S,\neg R)) \rangle \\
        &= \alpha * \langle (0.5 * 0.1 * 0.8 * 0.01 + 0.5 * 0.9 * 0.8 * 0.1 + 0.5 * 0.1 * 0.2 * 0.1 + 0.5 * 0.9 * 0.2 * 1.0), \\
        &(0.5 * 0.5 * 0.2 * 0.01 + 0.5 * 0.5 * 0.2 * 0.1 + 0.5 * 0.5 * 0.8 * 0.1 + 0.5 * 0.5 * 0.8 * 1.0) \rangle \\
        &= \alpha * \langle 0.1274, 0.2255\rangle \\
        &= \langle 0.361, 0.639\rangle \\
    \end{aligned}$
    

In [8]:
print(enumeration_ask('Cloudy', dict(WetGrass=F), grass).show_approx())

False: 0.639, True: 0.361
