This is a simple application of Bayes' Theorem for a spam filter classifier
The Bayes' Theoreom is at the base of conditional probability and is defined as:
Where:
is the posterior probability: what we are trying to estimate.
is the likelihood: a conditional probability that can be found from data we can obtain from some process.
is the prior probability: the probability we already know and is being updated in the posterior probability.
is the evidence: the new piece of data that we are taking in consideration to update the posterior probability.
Note that the notations 'h' and 'D' could be anything but in the context of machine learning they are usually chosen to indicate hypothesis and Data.
For the spam filter classifier the Bayes' Theorem becomes:
Here our hypothesis is the occurrance of a word in spams and hams ( ), and the data is each word in a given email (
).
We are trying to find the probability of the hypothesis given the data (
) multiplying the probability of the data given the hypothesis (
) by the probability of the hypothesis (
).
The probability of the data given the hypothesis (
) is the bit we can 'train' with our dataset in the classifier and the probability of the hypothesis (
) is the one we assume, for both cases spam and ham, and compare the resulting probabilities to give a final classification for a new message.
Note that the denominator is being ignored here. It would be the probability of a word to be contained in an email regardless of it being a spam or ham ( ). This is not taken in consideration because it is not relevant and more importantly it is just a normalization constant, which doesn't depend on the parameter.
The sample dataset provided is from this Kaggle dataset. The classifier is very basic and can be improved greatly. It is meant to demonstrate how the Bayes' Theorem is applicable to Machine Learning.
- numpy
- pandas
- sklearn
Install these using pip
Type python sample_code.py
to run the code.