# Homework 2

---------------

## Exercise 1: Spam Filter

Below is the implementation of Paul Graham's algoritm outlined in "A Plan for Spam." It takes in a known list of words from spam and ham emails and computes the probability that a given word is likely from spam email or not.

In [4]:
import collections

class SpamFilter:
    

    def __init__(self, spam, ham):
        self.spam = []
        self.ham = []
        
        # Fill local lists with the words from the two corpora
        for i in spam:
            for j in i:
                self.spam.append(j.lower())        
        for i in ham:
            for j in i:
                self.ham.append(j.lower())
                
        self.nbad = len(spam)
        self.ngood = len(ham)
                
        # Counts the number of each word in both of the list and adds those to the hash table
        self.spam_hash = collections.Counter(self.spam)
        self.ham_hash = collections.Counter(self.ham)
        

        
        self.both = self.spam_hash.copy()
        self.both.update(self.ham_hash)
        self.probs = self.both.copy()
        
        for i in self.both.keys():
            self.probs[i] = self.find_probabilities(i)

    def find_probabilities(self, key):
        g = 2 * self.ham_hash[key]
        b = self.spam_hash[key]
        
        if b + g > 1:
            return max( 0.01, min(0.99, min(1.0, b/self.nbad) / (min(1.0, g/self.ngood) + min(1.0, b/self.nbad))))
        else:
            return 0
    
    def spam_filter(self, corpus):
        probs = []
        for i in corpus:
            if i.lower() in self.probs:
                probs.append(self.probs[i.lower()])
            else:
                probs.append(0.4)        
        prod = 1
        comp_prod = 1
        
        for i in probs:
            prod *= i
            comp_prod *= (1 - i)
        
        return prod / (prod + comp_prod)
        

The table of probabilities (shown below) is then used to compute the likelyhood that a new email is spam or not by passing those words in through an algorithm that adds up all of those probabilities. The sample corpora are from the problem definition. The first email test should come out spam becasue it contains the words "I am spam" which are all indicators of spam email.

In [5]:
spam_corpus = [["I", "am", "spam", "spam", "I", "am"], ["I", "do", "not", "like", "that", "spamiam"]]
ham_corpus = [["do", "i", "like", "green", "eggs", "and", "ham"], ["i", "do"]]

spam_filter = SpamFilter(spam_corpus, ham_corpus)

print(spam_filter.probs)

Counter({'am': 0.99, 'spam': 0.99, 'i': 0.5, 'do': 0.3333333333333333, 'like': 0.3333333333333333, 'green': 0.01, 'eggs': 0.01, 'and': 0.01, 'ham': 0.01, 'not': 0, 'that': 0, 'spamiam': 0})


In [6]:

test_mail_1 = ['I','am','spam']
test_mail_1_result = spam_filter.spam_filter(test_mail_1)
if test_mail_1_result > 0.9:
    print("Proability of spam is %10.5f, test mail 1 is spam!"% test_mail_1_result)
else:
    print("Proability of spam is %10.5f, test mail 1 is ham!"% test_mail_1_result)
    

Proability of spam is    0.99990, test mail 1 is spam!


And behold it does. The next test is the email "ham i am" which are all indicators of (or at least are a part of) non spam email.

In [7]:
test_mail_2 = ['ham', 'i', 'am']
test_mail_2_result = spam_filter.spam_filter(test_mail_2)
if test_mail_2_result > 0.9:
    print("Proability of spam is %10.5f, test mail 2 is spam!"% test_mail_2_result)
else:
    print("Proability of spam is %10.5f, test mail 2 is ham!"% test_mail_2_result)


Proability of spam is    0.50000, test mail 2 is ham!


This next test tests the filter on new words that it hasn't seen before, which show that new words are generally considered ham with some uncertainty (0.4 per word), so the email comes out as ham.

In [8]:
test_mail_3 = ['These','are','new','words']
test_mail_3_result = spam_filter.spam_filter(test_mail_3)
if test_mail_3_result > 0.9:
    print("Proability of spam is %10.5f, test mail 3 is spam!"% test_mail_3_result)
else:
    print("Proability of spam is %10.5f, test mail 3 is ham!"% test_mail_3_result)



Proability of spam is    0.16495, test mail 3 is ham!


The last example shows a mix of both spam and ham words ("green spam and not ham"), which shows that the algorithm is biased toward ham, valuing them twice as much as spam words, so it comes out ham.

In [9]:
test_mail_4 = ['green','spam','and','not', 'ham']
test_mail_4_result = spam_filter.spam_filter(test_mail_4)
if test_mail_3_result > 0.9:
    print("Proability of spam is %10.5f, test mail 4 is spam!"% test_mail_4_result)
else:
    print("Proability of spam is %10.5f, test mail 4 is ham!"% test_mail_4_result)


Proability of spam is    0.00000, test mail 4 is ham!


### Why this spam filter is bayesian

This algorithm is Bayesian first because it is probabilistic. It uses probabilities to determine if an email is spam or not. It is also Bayesian in that the probabilities are considered independent, each word is considered separately and does not consider which words were around it, the context it was in, or if words "cause" other words. This makes the algorithm naive Bayesian because naive Bayes algorithms don't consider dependence. It is also Bayesian in that it combines the probabilities of multiple words into one final solution just as Bayes networks take many probabilities together to find the desired probability.

----------------------------------

## Exercise 2: Bayesian Networks

### a. Implementation of the Bayesian network shown in Figure 14.12a

In [10]:
from probability import BayesNet, enumeration_ask, elimination_ask, gibbs_ask

# Utility variables
T, F = True, False

# From AIMA code (probability.py) - Fig. 14.2 - burglary example
grass = BayesNet([
    ('Cloudy', '', 0.5),
    ('Sprinkler', 'Cloudy', {T: 0.1, F: 0.50}),
    ('Rain', 'Cloudy', {T: 0.80, F: 0.20}),
    ('WetGrass', 'Sprinkler Rain', {(T, T): 0.99, (T, F): 0.90, (F, T): 0.90, (F, F): 0.00})
])

### b. Compute the number of independent values in the full joint probability distribution for this domain. Assume that no conditional independence relations are known to hold between these values.

$\begin{aligned}
    \textbf values = 2^4
        &= 16
\end{aligned}$
    
### c. Compute the number of independent values in the Bayesian network for this domain. Assume the conditional independence relations implied by the Bayes network.

$\begin{aligned}
    \textbf values = 1+2+2+4
        &= 9
\end{aligned}$

### d. Compute probabilities for the following:


In [11]:
print("compute P(Cloudy)")
print(enumeration_ask('Cloudy', dict(), grass).show_approx())

compute P(Cloudy)
False: 0.5, True: 0.5


This value comes strait from the diagram.
$\begin{aligned}
    \textbf <0.5, 0.5>
\end{aligned}$

In [12]:
print("\nP(Sprinker | cloudy)")
print(enumeration_ask('Sprinkler', dict(Cloudy=T), grass).show_approx())


P(Sprinker | cloudy)
False: 0.9, True: 0.1


This can also be taken directly from the diagram.
$\begin{aligned}
    \textbf <0.1, 0.9>
\end{aligned}$

In [13]:
print("\nP(Cloudy| the sprinkler is running and it’s not raining)")
print(enumeration_ask('Cloudy', dict(Sprinkler=T, Rain=F), grass).show_approx())


P(Cloudy| the sprinkler is running and it’s not raining)
False: 0.952, True: 0.0476


$\begin{aligned}
    \textbf {P}(cloudy|s, \urcorner r )
    &=\alpha < P(c, s, \urcorner r), P(\urcorner c, s, \urcorner r) > \\
    &= \alpha < P(c) \cdot P(s|c )\cdot P(\urcorner r|c) ,  P(\urcorner c)  \cdot P(s|\urcorner c )\cdot P(\urcorner r|\urcorner c) >\\
    &= \alpha < (0.50)(0.20)(0.10) , (0.50)(0.50)(0.80) > \\
    &= \alpha < 0.01, 0.2 > \\
    &= <0.048, 0.952 >
\end{aligned}$


In [30]:
print("\nP(WetGrass | it’s cloudy, the sprinkler is running and it’s raining)")
print(enumeration_ask('WetGrass', dict(Cloudy=T, Sprinkler=T, Rain=T), grass).show_approx())


P(WetGrass | it’s cloudy, the sprinkler is running and it’s raining)
False: 0.01, True: 0.99


$\begin{aligned}
    \textbf {P}(Wet|cloudy, sprinker, rain )
    &=\alpha <  P(w, c, s, r), P(\urcorner w, c, s, r) > \\
    &= \alpha < P(w|s,r) P(s|c) P(r|c) P(c), P(\urcorner w|s,r)P(s|c) P(r|c) P(c) > \\
    &= \alpha < (0.99)(0.10)(0.80)(0.50) , (0.01)(0.10)(0.80)(0.50) > \\
    &= \alpha < 0.0396, 0.0004 > \\
    &= <0.99, 0.01 >
\end{aligned}$

In [31]:
print("\nP(Cloudy | the grass is not wet)")
print(enumeration_ask('Cloudy', dict(WetGrass=F), grass).show_approx())


P(Cloudy | the grass is not wet)
False: 0.639, True: 0.361


$\begin{aligned}
    \textbf {P}(Cloudy | \urcorner Wet )
    &= \alpha \sum_{s,r} P(c,w,s,r) \\
    &= \alpha < P(c)P(r|c)P(s|c)P(\urcorner w|s,r) + P(c)P(\urcorner r|c)P(s|c)P(\urcorner w|s,\urcorner r) \\
    &           P(c)P(r|c)P(\urcorner s|c)P(\urcorner w|\urcorner s,r) + P(c)P(\urcorner r|c)P(\urcorner s|c)P(\urcorner w|\urcorner s,\urcorner r) , \\
    &           P(\urcorner c)P(r|\urcorner c)P(s|\urcorner c)P(\urcorner w|s,r) + P(\urcorner c)P(\urcorner r|\urcorner c)P(s|\urcorner c)P(\urcorner w|s,\urcorner r) \\
    &           P(\urcorner c)P(r|\urcorner c)P(\urcorner s|\urcorner c)P(\urcorner w|\urcorner s,r) + P(\urcorner c)P(\urcorner r|\urcorner c)P(\urcorner s|\urcorner c)P(\urcorner w|\urcorner s,\urcorner r)> \\
    &= \alpha < (0.50) (0.80) (0.10) (0.01) + (0.50) (0.20 (0.10) (1.00) + \\ &(0.50) (0.80) (0.90) (0.10) + (0.50) (0.20) (0.90) (1.00) , \\ & (0.50 (0.20) (0.50) (0.01) + (0.50 (0.80) (0.50) (0.10) +  \\  & (0.50) (0.20) (0.50) (0.10) + (0.50) (0.80) (0.50) (1.00) > \\
    &= \alpha < 0.1272, 0.2255 > \\
    &= <0.361, 0.639 >
\end{aligned}$