Skip to content

Simple Naive Bayes Spam Filter demonstrating the use of Bayes' Theorem in machine learning and simple probabilistic modeling.

Notifications You must be signed in to change notification settings

FullBeardDev/BayesTheoremSpamFilter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bayes Theorem Spam Filter

This is a simple application of Bayes' Theorem for a spam filter classifier

Bayes' Theorem

The Bayes' Theoreom is at the base of conditional probability and is defined as:

BayesTheoremFormula

Where:

  • PosteriorProbability is the posterior probability: what we are trying to estimate.
  • Likelihood is the likelihood: a conditional probability that can be found from data we can obtain from some process.
  • PriorProbability is the prior probability: the probability we already know and is being updated in the posterior probability.
  • Evidence is the evidence: the new piece of data that we are taking in consideration to update the posterior probability.

Note that the notations 'h' and 'D' could be anything but in the context of machine learning they are usually chosen to indicate hypothesis and Data.

For the spam filter classifier the Bayes' Theorem becomes:

FormulaForClassifier

Here our hypothesis is the occurrance of a word in spams and hams ( isSpam ), and the data is each word in a given email ( word ). We are trying to find the probability of the hypothesis given the data ( hgd ) multiplying the probability of the data given the hypothesis ( dgh ) by the probability of the hypothesis ( h ). The probability of the data given the hypothesis ( dgh ) is the bit we can 'train' with our dataset in the classifier and the probability of the hypothesis ( h ) is the one we assume, for both cases spam and ham, and compare the resulting probabilities to give a final classification for a new message.

Note that the denominator is being ignored here. It would be the probability of a word to be contained in an email regardless of it being a spam or ham ( evidence ). This is not taken in consideration because it is not relevant and more importantly it is just a normalization constant, which doesn't depend on the parameter.

Overview

The sample dataset provided is from this Kaggle dataset. The classifier is very basic and can be improved greatly. It is meant to demonstrate how the Bayes' Theorem is applicable to Machine Learning.

Dependencies

  • numpy
  • pandas
  • sklearn

Install these using pip

Usage

Type python sample_code.py to run the code.

About

Simple Naive Bayes Spam Filter demonstrating the use of Bayes' Theorem in machine learning and simple probabilistic modeling.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages