# Statistical Approach - The Benchmark

As a first step towards developing a better solution to the 'Opinion spam problem', and also to generate a benchmark for ourselves, we used a traditional statistical modelling approach to create our first model. In this case we used Naive Bayes from the scikit learn library.

## Naive Bayes?

Naive Bayes is a common method for producing a baseline that can be compared to other, experimental methods. It is based on Bayes theroem with a 'naive' twist.

It is Bayes Theorem with an assumption: “The conditions are independent.”

### What is a condition?

In Bayes Theorem, we say 'Probability of A given B'. The condition is B:

$$ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $$

Now, when we have multiple conditions Bayes Theorem looks like this:

$$ P(A \mid x_1,...x_n) = \frac{P(x_1,...x_n \mid A) \, P(A)}{P(x_1,...x_n)} $$

Where x<sub>1</sub>,…x<sub>n</sub> denotes the intersection of all events x<sub>1</sub>,…x<sub>n</sub> together.


We want to find the class y with the maximum probability. We can cycle through all possible combinations of all x<sub>i</sub> events, but this has a complexity of m<sup>n</sup>. (m = possible values for a feature, n = number of features) ([more here](https://en.wikipedia.org/wiki/Bayesian_network#Example))

Instead of using this exponential algorithm, we can change the complexity to linear time by making an assumption.

# Naive Bayes assumes features are independent

By assuming our incoming features are independent, we can compute our formula by simply multiplying all of their feature probabilities together. We no longer need to worry about combinations, so our algorithm goes from exponential to linear.

$$ P(A \mid x_1,...x_n) \propto P(A) \prod_{i=1}^n P(x_i \mid A) $$

Note also that our denominator P(x<sub>1</sub>,...,x<sub>n</sub>) is not needed since it is constant. We have excluded the denominator in our above equation.

This assumption made by Naive Bayes is very unlikely to be a correct assumption, however it often performs well regardless. This means that it has become a common benchmark technique for data science experiments.