## Gaussian Naive Bayes Classifier

$f(x_i | y) = \frac{1}{\sqrt{2\pi{\sigma_y}^2}} 
  \exp\left( -\frac{\left(x_i-\mu_y\right)^2}{2{\sigma_y}^2}\right)$
  

In [42]:
import pandas as pd
import numpy as np

# Create dictioanrt 
data = {'Gender': ['male','male','male','male','female','female','female','female'],
        'Height': [6,5.92,5.58,5.92,5,5.5,5.42,5.75],
        'Weight': [180,190,170,165,100,150,130,150],
        'Foot_Size': [12,11,12,10,6,8,7,9]
       }
# Create dataframe from dictioanrt 
df = pd.DataFrame(data)
df

Unnamed: 0,Gender,Height,Weight,Foot_Size
0,male,6.0,180,12
1,male,5.92,190,11
2,male,5.58,170,12
3,male,5.92,165,10
4,female,5.0,100,6
5,female,5.5,150,8
6,female,5.42,130,7
7,female,5.75,150,9


### Priors

In [25]:
# Number of males
n_male = df['Gender'][df['Gender'] == 'male'].count()
# Number of males
n_female = df['Gender'][df['Gender'] == 'female'].count()
# Total rows
total_ppl = df['Gender'].count()
# Number of males divided by the total rows
P_male = n_male/total_ppl
# Number of females divided by the total rows
P_female = n_female/total_ppl
print(f"total count:{total_ppl}, {n_male} male ({P_male * 100}%) and {n_female} female ({P_female * 100}%)")

total count:8, 4 male (50.0%) and 4 female (50.0%)


### Likelihood

In [29]:
# Group the data by gender and calculate the means of each feature
data_means = df.groupby('Gender').mean()
print(data_means.head(5))

# Group the data by gender and calculate the variance of each feature
data_variance = df.groupby('Gender').var()
print(data_variance.head(5))


        Height  Weight  Foot_Size
Gender                           
female  5.4175  132.50       7.50
male    5.8550  176.25      11.25
          Height      Weight  Foot_Size
Gender                                 
female  0.097225  558.333333   1.666667
male    0.035033  122.916667   0.916667


### Formula

posterior (male) = P(male)*P(height|male)*P(weight|male)*P(foot size|male) / evidence <br>
posterior (female) = P(female)*P(height|female)*P(weight|female)*P(foot size|female) / evidence <br><br>
Evidence = P(male)*P(height|male)*P(weight|male)*P(foot size|male) + P(female) * P(height|female) * P(weight|female)*P(foot size|female) <br><br>
The evidence may be ignored since it is a positive constant. (Normal distributions are always positive.)<br>

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} 
  \exp\left( -\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^{\!2}\,\right)$
  
  $\pi \simeq 3.1415926535...\\
  e \simeq 2.71828..$

In [30]:
# Means for male
male_height_mean = data_means['Height'][data_variance.index == 'male'].values[0]
male_weight_mean = data_means['Weight'][data_variance.index == 'male'].values[0]
male_footsize_mean = data_means['Foot_Size'][data_variance.index == 'male'].values[0]

# Variance for male
male_height_variance = data_variance['Height'][data_variance.index == 'male'].values[0]
male_weight_variance = data_variance['Weight'][data_variance.index == 'male'].values[0]
male_footsize_variance = data_variance['Foot_Size'][data_variance.index == 'male'].values[0]

# Means for female
female_height_mean = data_means['Height'][data_variance.index == 'female'].values[0]
female_weight_mean = data_means['Weight'][data_variance.index == 'female'].values[0]
female_footsize_mean = data_means['Foot_Size'][data_variance.index == 'female'].values[0]

# Variance for female
female_height_variance = data_variance['Height'][data_variance.index == 'female'].values[0]
female_weight_variance = data_variance['Weight'][data_variance.index == 'female'].values[0]
female_footsize_variance = data_variance['Foot_Size'][data_variance.index == 'female'].values[0]

In [32]:
# Create a function that calculates p(x | y):
def p_x_given_y(x, mean_y, variance_y):
    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    return p

In [36]:
# Create a data point with no gender with some feature values for this single row
person = {'Height': 6,'Weight': 200, 'Foot_Size': 11} # myself
df_new = pd.DataFrame(person, index=[0])
df_new

Unnamed: 0,Height,Weight,Foot_Size
0,6,200,11


### Posterior 

In [39]:
# Numerator of the posterior if the unclassified observation is a male
mp = P_male * \
p_x_given_y(df_new['Height'][0], male_height_mean, male_height_variance) * \
p_x_given_y(df_new['Weight'][0], male_weight_mean, male_weight_variance) * \
p_x_given_y(df_new['Foot_Size'][0], male_footsize_mean, male_footsize_variance)

In [40]:
# Numerator of the posterior if the unclassified observation is a female
fp=P_female * \
p_x_given_y(df_new['Height'][0], female_height_mean, female_height_variance) * \
p_x_given_y(df_new['Weight'][0], female_weight_mean, female_weight_variance) * \
p_x_given_y(df_new['Foot_Size'][0], female_footsize_mean, female_footsize_variance)

In [41]:
if mp>fp:
    print('male')
else:
    print('female')

male
