# 1. Spam Filtering

In [1]:
import string

spam_corpus = [["I", "am", "spam", "spam", "I", "am"], ["I", "do", "not", "like", "that", "spamiam"]]
ham_corpus = [["do", "i", "like", "green", "eggs", "and", "ham"], ["i", "do"]]

THRESHOLD = 1
BASE_PROB = 0.5


class BayesSpam:

    def __init__(self, ham, spam):

        self.threshold = THRESHOLD
        self.unknown_probability_value = BASE_PROB
        self.ngood = len(ham)
        self.nbad = len(spam)

        self.ham_hash = self.hash_occurances(ham)
        self.spam_hash = self.hash_occurances(spam)

        # get a list of all words, from either corpus
        # Credit to https://stackoverflow.com/a/16902603
        self.token_list = set().union(*[self.ham_hash, self.spam_hash])

        # Create the combined score hashmap
        self.score_hash = {}

        for word in self.token_list:
            # Ternary operation to assign good value
            g = (2 * self.ham_hash[word] if word in self.ham_hash else self.unknown_probability_value)
            # Ternary operation to assign bad value
            b = (self.spam_hash[word] if word in self.spam_hash else self.unknown_probability_value)

            if g + b > self.threshold:
                self.score_hash[word] = max(0.01, min(0.99, min(1.0, b / self.nbad) /
                                                      (min(1.0, g / self.ngood) + min(1.0, b / self.nbad))))

        print("Scores:")
        print(self.score_hash)

    @staticmethod
    def hash_occurances(corpus):
        new_hash = {}
        for i in corpus:
            for v in i:
                if v in new_hash:
                    new_hash[v.upper()] += 1
                else:
                    new_hash[v.upper()] = 1
        return new_hash

    def filter_spam(self, mail):
        mail = mail.translate(str.maketrans('', '', string.punctuation))  # remove punctuation
        words = mail.upper().split()
        prod_list = 1
        compliment_list = 1

        for word in words:
            if word in self.score_hash:
                prob = self.score_hash[word]
            else:
                prob = self.unknown_probability_value

            prod_list *= prob
            compliment_list *= (1 - prob)

        return prod_list / (prod_list + compliment_list)
    
    def to_str(self, text):
        return '"' + text + "\"\nSpam Score: " + str(self.filter_spam(text)) + "\n============"
    
spam_filter = BayesSpam(ham_corpus, spam_corpus)
print("\n======TESTS======\n")
print(spam_filter.to_str("i"))
print(spam_filter.to_str("I am sam, sam I am, do you like green eggs and ham?"))
print(spam_filter.to_str("spamiam"))
print(spam_filter.to_str("I am spam, spam I am"))

Scores:
{'AND': 0.2, 'SPAMIAM': 0.6666666666666666, 'NOT': 0.6666666666666666, 'SPAM': 0.6666666666666666, 'I': 0.5, 'GREEN': 0.2, 'THAT': 0.6666666666666666, 'HAM': 0.2, 'LIKE': 0.3333333333333333, 'DO': 0.3333333333333333, 'EGGS': 0.2, 'AM': 0.6666666666666666}


"i"
Spam Score: 0.5
"I am sam, sam I am, do you like green eggs and ham?"
Spam Score: 0.0038910505836575863
"spamiam"
Spam Score: 0.6666666666666666
"I am spam, spam I am"
Spam Score: 0.9411764705882353


This approach to spam filtering is Bayesian because it is based on **statistics** rather than hard properties. As this [deep AI article states][ai-link], "probability is a subjective process that can change as new information is gathered, rather than a fixed value based upon frequency or propensity." 

The big benefit of this is that the model will adapt to what it is told is spam, and therefore can change with the nature of spam for a given user. Graham notes in the ["plan for spam" FAQ][faq-link] that spammers can't really defeat this *even if they know about it* and try to tune their emails to get through -- the filters are A) different for each user and B) would simply start filtering according to the new spam paradigms.

[ai-link]:https://deepai.org/machine-learning-glossary-and-terms/bayesian-statistics
[faq-link]:http://www.paulgraham.com/spamfaq.html

# 2. Bayesian Networks

## a.

In [6]:
from probability import BayesNet, enumeration_ask, elimination_ask, gibbs_ask

# Utility variables
T, F = True, False

wet_grass = BayesNet([
    ('Cloudy', '', 0.5),
    ('Sprinkler', 'Cloudy', {T: 0.1, F: 0.5}),
    ('Rain', 'Cloudy', {T: 0.8, F: 0.2}),
    ('Wet_Grass', 'Sprinkler Rain', {(T, T): 0.99, (T, F): 0.9, (F, T): 0.9, (F, F): 0.0})
])

2.i:	 False: 0.5, True: 0.5
2.ii:	 False: 0.9, True: 0.1
2.iii:	 False: 0.952, True: 0.0476
2.iv:	 False: 0.01, True: 0.99
2.v:	 False: 0.639, True: 0.361


## b.

\# of independent values for the full joint distribution = 2 * 2 * 2 * 2 = 2\*\*4 = **16**

* This number comes from the 4 variables, which have 2 states each.

## c.

\# of independent values in the Bayesian network = (Cloudy) + (Sprinkler) + (Rain) + (Wet Grass) = 1 + 2 + 2 + 4 = **9**

* This number is derived from the conditional probability tables in 14.12 (a)

## d.

First, the computed results:

In [9]:
print("2.d.i:\t\t", enumeration_ask('Cloudy', dict(), wet_grass).show_approx())
print("2.d.ii:\t\t", enumeration_ask('Sprinkler', dict(Cloudy=T), wet_grass).show_approx())
print("2.d.iii:\t",enumeration_ask('Cloudy', dict(Sprinkler=T, Rain=F), wet_grass).show_approx())
print("2.d.iv:\t\t",enumeration_ask('Wet_Grass', dict(Cloudy=T, Sprinkler=T, Rain=T), wet_grass).show_approx())
print("2.d.v:\t\t",enumeration_ask('Cloudy', dict(Wet_Grass=F), wet_grass).show_approx())

2.d.i:		 False: 0.5, True: 0.5
2.d.ii:		 False: 0.9, True: 0.1
2.d.iii:	 False: 0.952, True: 0.0476
2.d.iv:		 False: 0.01, True: 0.99
2.d.v:		 False: 0.639, True: 0.361


Now verified manually:

### i.
`P(Cloudy) = <0.5, 0.5>` 

Straightforward, from the table

### ii.
`P(Sprinkler | cloudy)` = `α<P(S|c), P(¬S|c)>` = `<0.1, 0.9>` 

Also fairly simple, just P(S) given C == t, then 1 minus that.

### iii.
`P(Cloudy | sprinkler ^ rain)   
        = α<P(C|s,¬r), P(¬C|s,¬r)> 
        = α<(0.5)(0.1)(0.2),(0.5)(0.5)(0.8)> 
        = α<0.01, 0.2> 
        = <0.0476, 0.952>`

Less simple now...

### iv.
`P(Wet Grass | cloudy ^ sprinkler ^ rain) 
        = αP(c)P(s)P(r)<P(W|s,r),P(¬W|s,r)
        = α(0.5)(0.1)(0.8)<0.99,0.1>
        = α<0.99, 0.1>
        = <0.99, 0.10>`
        
In the end, the alpha values and the P(x)s essentially cancelled out.

### v.
`P(Cloud | ¬Wet Grass)
        = α Σs(Σr( P(C) * P(s^r) * P(w | s^r) ) )`

#### Table of Probability Summations

|P(C) * P(S^R) | S=T | S=F |
| ------ | --- | --- |
| R = T  | (0.5)(0.08)(0.01) | (0.5)(0.02)(0.1) |
| R = F  | (0.5)(0.72)(0.1)  | (0.5)(0.18)(1.)  |


| P(¬C) * P(S^R) | S=T | S=F |
| ------ | --- | --- |
| R = T  | (0.5)(0.1)(0.01) | (0.5)(0.4)(0.1) |
| R = F  | (0.5)(0.1)(0.1)  | (0.5)(0.4)(1.)  |


`.       
        = α <0.127, 0.226>
        = <0.36, 0.64>`