# Naive Bayes

**Naive Bayes** is a simple classifier known for doing well when only a small number of observations is available. In this tutorial we will create a gaussian naive bayes classifier from scratch and use it to predict the class of a previously unseen data point.

### Preliminaries

In [1]:
import pandas as pd
import numpy as np

## Create Data

Our dataset is contains data on eight individuals. We will use the dataset to construct a classifier that takes in the `height`, `weight`, and `foot size` of an individual and outputs a prediction for their `gender`.

In [2]:
# create an empty dataframe
data = pd.DataFrame()

# create our target variable
data['Gender'] = [
    'male',
    'male',
    'male',
    'male',
    'female',
    'female',
    'female',
    'female',]

# create our feature variables
data['Height'] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data['Weight'] = [180,190,170,165,100,150,130,150]
data['Foot_Size'] = [12,11,12,10,6,8,7,9]

In [3]:
# view the data
data

Unnamed: 0,Gender,Height,Weight,Foot_Size
0,male,6.0,180,12
1,male,5.92,190,11
2,male,5.58,170,12
3,male,5.92,165,10
4,female,5.0,100,6
5,female,5.5,150,8
6,female,5.42,130,7
7,female,5.75,150,9


The dataset above is used to construct our classifier. Below we will create a new person for whom we know their feature values but not their gender. Our goal is to predict their gender.

In [4]:
# create an empty dataframe
person = pd.DataFrame()

# create some feature values for this single row
person['Height'] = [6]
person['Weight'] = [130]
person['Foot_Size'] = [8]

# view the data
person

Unnamed: 0,Height,Weight,Foot_Size
0,6,130,8


## Bayes Theorem

Bayes theorem is a famous equation that allows us to make predictions based on data. Here is the classic version of the Bayes theorem.

$$ P(A | B) = \frac{P(A) P(B | A)}{P(B)}$$

This might be too abstract, so let us replace some of the variables to make it more concrete. In a Bayes classifier, we are interested in finding out the class (e.g. male or female, spam or ham) of an observation _given_ the data.

$$ P(class | data) = \frac{P(class) P(data | class)}{P(data)}$$

where:
- class if a particular class (e.g. male)
- data is an observation's data
- $p(class|data)$ is called the posterior
- $p(data|class)$ is called the likelihood
- $p(class)$ is called the prior
- $p(data)$ is called the marginal probability

Applying this theorem to our data:

$$ P(person \ is \ male \ |\ person's \ data) = \frac{P(person \ is \ male) \ P(person's \ data \ |\ person \ is \ male)}{P(person's \ data)}$$

## Gaussian Naive Bayes Classifier

A gaussian naive bayes is probably the most popular type of bayes classifier. To explan what the name means, let us look at what the bayes equations look like when we apply our two classes (male and female) and three feature variables (height, weight, and foot size):

$$ posterior(male)) = \frac{P(male) \ p(height \ |\ male) \ p(weight \ |\ male) \ p(foot \ size \ |\ male))}{P(person's \ data)}$$

`marginal probability` is probably one of the most confusing parts of bayesian approaches. In fabricated examples such as this, it is possible to calculate the marginal probability. However, in many real-world cases, it is either extrememely difficult or impossible to find the value of the marginal probability (and explaining why is beyond the scope of this tutorial).

This is not as much of a problem for our classifier as you might think. Why? Because we don't care what the true posterior value is, we only care which class has the highest posterior value. And because the marginal probability is the same for all classes:

1. We can ignore the denominator;
2. Calculate only the posterior's numerator for each class;
3. Then pick the largets numerator.

That is, we can ignore the posterior's denominator and make a prediction solely on the relative values of the posterior's numerator.

## Calculate Priors

Priors can be either constants or probability distributions. In our example, this is simply the probability of being a gender.

Calculating this is simple:

In [5]:
# number of males
n_male = data['Gender'][data['Gender'] == 'male'].count()

# number of females
n_female = data['Gender'][data['Gender'] == 'female'].count()

# total rows
total_ppl = data['Gender'].count()

In [6]:
# number of males / total rows
P_male = n_male / total_ppl
P_female = n_female / total_ppl

## Calculate Likelihood

Remember that each term in our likelihood is assumed to be a normal distribution. This means that for each class and feature combination we need to calculate the variance and mean value from the data. Pandas makes this easy.

In [7]:
# group the data by gender and calculate the means of each feature
data_means = data.groupby('Gender').mean()

# view the values
data_means

Unnamed: 0_level_0,Height,Weight,Foot_Size
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,5.4175,132.5,7.5
male,5.855,176.25,11.25


In [8]:
# group the data by gender and calculate the variance of each feature
data_variance = data.groupby('Gender').var()

# view the values
data_variance

Unnamed: 0_level_0,Height,Weight,Foot_Size
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.097225,558.333333,1.666667
male,0.035033,122.916667,0.916667


No we can create all the variables we need. The code below might look complex but all we are doing is creating a variable out of each cell in both of the tables above.

In [9]:
# means for male
male_height_mean = data_means['Height'][data_variance.index == 'male'].values[0]
male_weight_mean = data_means['Weight'][data_variance.index == 'male'].values[0]
male_footsize_mean = data_means['Foot_Size'][data_variance.index == 'male'].values[0]

# variance for male
male_height_variance = data_variance['Height'][data_variance.index == 'male'].values[0]
male_weight_variance = data_variance['Weight'][data_variance.index == 'male'].values[0]
male_footsize_variance = data_variance['Foot_Size'][data_variance.index == 'male'].values[0]

# means for female
female_height_mean = data_means['Height'][data_variance.index == 'female'].values[0]
female_weight_mean = data_means['Weight'][data_variance.index == 'female'].values[0]
female_footsize_mean = data_means['Foot_Size'][data_variance.index == 'female'].values[0]

# variance for female
female_height_variance = data_variance['Height'][data_variance.index == 'female'].values[0]
female_weight_variance = data_variance['Weight'][data_variance.index == 'female'].values[0]
female_footsize_variance = data_variance['Foot_Size'][data_variance.index == 'female'].values[0]

Finally, we need to create a function to calculate the probability density of each of the terms of the likelihood.

In [11]:
# create a function that calculates p(x | y):
def p_xy(x, mean_y, variance_y):

    # input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

## Apply Bayes Classifier to New Data Point

Our bayes classifier is ready. Remember that since we can ignore the marginal probability (the denominator), what we are actually calculating is just the numerator. To do this, we just need to plug in the values of the unclassified person, the variables of the dataset, and the function we made above:

In [14]:
# numerator of the posterior if the unclassified observation is a male
P_male * p_xy(person['Height'][0], male_height_mean, male_height_variance) * \
    p_xy(person['Foot_Size'][0], male_footsize_mean, male_footsize_variance)       

0.0010351324281323293

In [15]:
# numerator of the posterior if the unclassified observation is a female
P_female * p_xy(person['Height'][0], female_height_mean, female_height_variance) * \
    p_xy(person['Foot_Size'][0], female_footsize_mean, female_footsize_variance)       

0.032031769397948134

Because the numerator of the posterior for female is greater than male, then we predict that the person is female.
