License: Attribution 4.0 International (CC BY 4.0) 

# Report 01

Author: Daniel Bishop

In [2]:
from thinkbayes2 import Pmf

## Book Solutions

Modified cookie bowl problem. Imagine that there are 2 bowls of cookies, each with different ratios of vanilla to chocolate chip cookies in them. After picking some cookies out of one bowl, what is the likelihood that it was one bowl or the other?

First we create a Bowl object which contains some number of vanilla and chocolate chip cookies, and the Cookie object, which is our PMF.

In [76]:
class Bowl():
    def __init__(self, num_vanilla, num_chocolate):
        self.vanilla = num_vanilla
        self.chocolate = num_chocolate
        
    def getPercentTotal(self, hypo):
        if (hypo == "vanilla"):
            mix = self.vanilla / (self.vanilla + self.chocolate)
            self.vanilla -= 1
        elif (hypo == "chocolate"):
            mix = self.chocolate / (self.vanilla + self.chocolate)
            self.chocolate -= 1
        else:
            mix = 0
        
        if (mix < 0):
            return 0
        else:
            return mix

class Cookie(Pmf):
    def __init__(self, hypos):
        Pmf.__init__(self)
        for hypo in hypos:
            self.Set(hypo, 1)
        self.Normalize()
        self.mixes = {
            'Bowl 1':Bowl(75, 25),
            'Bowl 2':Bowl(50, 50),
        }
    
    def Update(self, data):
        for hypo in self.Values():
            like = self.Likelihood(data, hypo)
            self.Mult(hypo, like)
        self.Normalize()
    
    def Likelihood(self, data, hypo):
        return self.mixes[hypo].getPercentTotal(data)

We can now use our model to predict the likelihood of having been picking from Bowl 1 or 2 based on which cookies we pick.

In [77]:
pmf = Cookie(['Bowl 1', 'Bowl 2'])
dataset = ["vanilla", "chocolate", "vanilla"]
for data in dataset:
    pmf.Update(data)
for hypo, prob in pmf.Items():
    print(hypo, prob)

Bowl 1 0.5311004784688996
Bowl 2 0.46889952153110054


Here you can see the effect of not replacing the cookie in the bowl which you took it from. For instance, picking 51 vanilla cookies rules out the possibility of having been using Bowl 2, since it only has 50 cookies in it.

In [78]:
pmf = Cookie(['Bowl 1', 'Bowl 2'])
dataset = ["vanilla" for x in range(51)]
for data in dataset:
    pmf.Update(data)
for hypo, prob in pmf.Items():
    print(hypo, prob)

Bowl 1 1.0
Bowl 2 0.0


## Mailing List Problems

In the Zombieland problem, students are given the task of shooting zombies. There are 2 types of students, biased ones and unbiased ones. The biased ones have a 2/3 chance of hitting the zombie when shooting at it, while the unbiased ones only havea 1/2 chance. Given that a student of unknown bias shot twice at a zombie, hitting it once and missing it once, what is the probability that the student is biased?

This problem can be set up in a very similar way to the cookie problem discussed above.

In [79]:
class Shooter(Pmf):
    def __init__(self, hypos):
        Pmf.__init__(self)
        for hypo in hypos:
            self.Set(hypo, 1)
        self.Normalize()
        self.mixes = {
            'biased':dict(hit=2/3, miss=1/3),
            'unbiased':dict(hit=1/2, miss=1/2),
        }
    
    def Update(self, data):
        for hypo in self.Values():
            like = self.Likelihood(data, hypo)
            self.Mult(hypo, like)
        self.Normalize()
    
    def Likelihood(self, data, hypo):
        mix = self.mixes[hypo]
        like = mix[data]
        return like

In [80]:
pmf = Shooter(['biased', 'unbiased'])
dataset = ['hit', 'miss']
for data in dataset:
    pmf.Update(data)
for hypo, prob in pmf.Items():
    print(hypo, prob)

biased 0.47058823529411764
unbiased 0.5294117647058824


Since the data collected matches a 50/50 spread, Bayes theorem helps us predict that there is a 53% chance that the shooter was unbiased. In order to be more sure, we would have to collect more data.

## Original Problem

Most spam emails contain some words that are not commonly found in 'regular' mail. Often, some complex text alanysis is used to determine the spamminess of the email, but I wanted to see how well Bayes theorem could be used to filter mail.

Let us assume that there are some set of words or phrases which are more likely to appear in spam emails such as "free money", "$$$", "f r e e", "save big money", or "stock alert". Each string here is associated with a certain likelihood that the email containing it is a spam email, based on prior analysis by humans or other computers.

In [86]:
spam_keywords = {
    "free money": .45,
    "$$$": .75,
    "f r e e": .81,
    "save big money": .42,
    "stock alert": .62
}

In [87]:
class Email(Pmf):
    def __init__(self, hypos, spam_words):
        Pmf.__init__(self)
        for hypo in hypos:
            self.Set(hypo, 1)
        self.Normalize()
        self.spam_keywords = spam_words
    
    def Update(self, data):
        for hypo in self.Values():
            like = self.Likelihood(data, hypo)
            self.Mult(hypo, like)
        self.Normalize()
    
    def Likelihood(self, data, hypo):
        like = self.spam_keywords[data]
        if (hypo == "spam"):
            return like
        else:
            return 1-like

By feeding the Bayes model with a set of keywords and the likelihood that an email containing each word is a spam email, we can then give the model the set of keywords that show up in an email and figure out the probability that said email is spam.

For example, an email containing the word '$$$' has a 75% chance of being a spam email since 75% of prior emails that contained the word were spam.

In [89]:
pmf = Email(['spam', 'not spam'], spam_keywords)
dataset = ['$$$']
for data in dataset:
    pmf.Update(data)
for hypo, prob in pmf.Items():
    print(hypo, prob)

not spam 0.25
spam 0.75


Once more keywords start appearing in the email, the probability that the email is spam skyrockets.

In [91]:
dataset = ['f r e e', 'stock alert']
for data in dataset:
    pmf.Update(data)
for hypo, prob in pmf.Items():
    print(hypo, prob)

not spam 0.006842547634414177
spam 0.993157452365586


However, there may also be some words which are highly unlikely to show up in a spam email, such as the name of your school. We add these words to the dictionary of spam keywords with a very low likelihood of being a spam email.

In [92]:
spam_keywords["olin"] = .01
spam_keywords["cute animals"] = .05
spam_keywords["grade"] = .01

Thus an email about "Olin \$\$\$" is much less likely to be a spam email than one that is just about "$$$".

In [93]:
pmf = Email(['spam', 'not spam'], spam_keywords)
dataset = ['olin', '$$$']
for data in dataset:
    pmf.Update(data)
for hypo, prob in pmf.Items():
    print(hypo, prob)

not spam 0.9705882352941176
spam 0.02941176470588235
