# Basic Naive Bayes

## Theory

Naive Bayes classifiers attempt to divine the probability of each class given a data set.  If we represent the predicted class as a random variable $Y$ and the set of features as a random variable $X$, then naive Bayes predicts a class from data $x_i$ by selecting the class $y_i$ that have the highest value of $P(Y=y_i|X=x_i)$.  In statistical nomenclature, the class with the highest posterior probability.  So training consists of figuring out how we can estimate these posterior probabilities from a set of training data.

To begin, let us assume that we are trying to predict the chance of borrower default given some historical data:

| married | diet | defaulted |
| --- | ---| ---|
| yes | conventional | no |
| no | conventional | yes |
| no | vegan | yes |
| yes | vegetarian | no |
| no | gluten-free | no |
| yes | West Coast | yes |


Given this data, what is the probability that a married vegan will default? 

We can attempt to solve this with the standard conditional probability definition, $$P(default | (married \cap vegan)) = \frac{P(default \cap married \cap vegan)}{P(married \cap vegan)} $$ but find that this leaves us trying to approximate the joint probability of our features.  We likely didn't gather data with this estimation in mind (we are attempting to correlate attributes with our predicted class, not other attributes).  If we assume indepedence of attributes, our denominator is always 0.  Instead, we would like to estimate $P(default | (married \cap vegan))$ based on the relationships between our attributes and our predicted class.  We can accomplish this using Bayes' formula, which is the consequence of the following equalities:
$$ P(Y|X) = \frac{P(Y \cap X)}{P(X)} $$
$$ P(X|Y) = \frac{P(X \cap Y)}{P(Y)} $$
$$P(X \cap Y) = P(Y \cap X)$$

Solving the first two for their joint probabilities, equating them, and then dividing by $P(X)$ yields Bayes formula:

$$ P(Y | X) = \frac{P(X|Y)P(Y)}{P(X)} $$

As noted above, we can make classify data $x_i$ by selecting the class $y_i$ that has the highest value of $P(Y=y_i|X=x_i)$.  Since $P(X)$ is constant across classes, we can ignore it when searching for the highest posterior.  Estimating $P(Y=y_i)$ from the training data is easy as it is just the proportion of labels that belong to a particular class.  Estimating $P(X | Y)$ is more difficult, since X is a vector of attributes and therefore $P(X=x_i)$ is shorthand for $P(x_0 = x_{i0} \cap x_1 = x_{i1} \cap ... \cap x_m = x_{im})$.  With __naive bayes__, this calculation is greatly simplified by assuming that the attributes are conditionally independent.  This assumption yields:

$$P((x_0 = x_{i0} \cap x_1 = x_{i1} \cap ... \cap x_m = x_{im}) | Y=y) = \prod_{j=0}^{m}{P(x_j=x_{ij} | Y=y)}$$

Consequently, we can estimate our posterior (dropping the class constant denominator) as:

$$P(Y=y | X=x_i) \sim = \prod_{j=0}^{m}{P(x_j=x_{ij} | Y=y)}*P(Y=y)$$

### Training

The learned 'model' is a set of prior probabilities (i.e. $P(Y=y)$ for all classes $y$) and a set of conditional probabilities (i.e. $P(x_j=x_{ij} | Y=y$).  We learn the priors based on class proportion in the training set and calculate the conditional probabilities based on the training data.

### Classifying

To classify, we calculate the posterior (i.e. $P(Y=y | X=x_i)$) for every class y using our learned priors and conditional probabilities.  Our label is the class with the highest posterior.

### Misc

Arguably the most useful characteristic of a naive bayes classifier is that it returns a probability for each class, which can be used to guage classification confidence.

### Exercise

Given our borrower default data, calculate the naive bayes 'model' and use it to predict if a married vegan will default on their loan.

---
Our model:

| model param | value |
|---|---|
| $P(defaulted=yes)$ | 0.5|
| $P(defaulted=no)$ | 0.5|
| $P(married=yes | defaulted=yes)$ | 0.333|
| $P(married=no | defaulted=yes)$ | 0.666|
| $P(married=yes | defaulted=no)$ | 0.666|
| $P(married=no | defaulted=no)$ | 0.333|
| $P(diet=conventional | defaulted=yes)$ | 0.333|
| $P(diet=conventional | defaulted=no)$ | 0.333|
| ... | ... |
| $P(diet=West Coast | defaulted=yes)$ | 0.333|
| $P(diet=West Coast | defaulted=no)$ | 0|

Our posteriors:


$\begin{align}
 P(default=yes|(married=yes \cap diet=vegan)) &= \\
 & =  P(married=yes | defaulted = yes) * P(diet=vegan | defaulted = yes) * P(defaulted=yes)\\
 & = 0.333 * 0.333 * 0.5 \\
 & = 0.055
\end{align}$

$\begin{align}
 P(default=no|(married=yes \cap diet=vegan)) &= \\
 & =  P(married=yes | defaulted = no) * P(diet=vegan | defaulted = no) * P(defaulted=no)\\
 & = 0.666 * \rho * 0.5 \\
 & = 0.166\rho
\end{align}$

We see how to actually deal with 0 conditional probabilities later but for now lets just say that $\rho=0.333$, so we have the posterior for default=yes  is 0.055 and for default=no is 0.0833.  Therefore, our model would predict that a married vegan would not default.


### Edge cases

#### Conditionals of zero
As we saw in our example above, a zero conditional can occur when there is too little data (esp. with regard to a single attribute).  Since a zero conditional will zero out the product of conditionals, it is important to come up with a heuristic for dealing with them. A common approach is to use the m-estimate:

$$P(x_i|y_i)=\frac{n_c+mp}{n+m}$$

where $m$ (equivalent sample size) and $p$ are parameters, $n_c$ is the number of records with attribute value $x_i$ and class $y_i$, and n is the number of records with class $y_i$.  We note that if there is no training set then $P(x_i|y_i)=p$, so $p$ can be conceived of as the prior probability of attribute value $x_i$ given class $y_i$. 

#### Estimating conditionals for continuous attributes
The astute student might notice that directly calculating conditionals for continuous variables would be impossible since any single point is unlikely to occur multiple times in our training set.  The simplest way to deal with this issue is discretization; divide up the range of values into regions, map the values to regions, and treat the region 'labels' as categorical variables.  The problem with this approach is that the analyst must select the regions and this is prone to error.

Alternatively, we can assume that the data fits a specific distribution (e.g. Gaussian) and use the class conditional probability of that distribution as our conditional.  Assuming a Gaussian distribution, we have:

$$P(X_i=x|Y=y_j)=\frac{1}{\sqrt{2\pi}\sigma_{ij}}e^{-\frac{(x-\mu_{ij})^2}{2\sigma^2_{ij}}}$$

Consequently, during the training phase we calculate the sample mean and sample variance of the attribute $X_i$'s values for records with class $y_j$ and use those values to estimate $\mu_{ij}$ and $\sigma_{ij}$ respectively.  At that point, we have a function for determining the conditional for a particular attribute value and class.

## Implementation

In [23]:
#Assumptions : Only string (nominal values) or numeric fields
import pandas as pd
import math

class naive_bayes:
    
    def __init__(self, equivalent_sample_size=1, default_attr_prior=0.1):
        self._prior = default_attr_prior
        self._eq_sample_size = equivalent_sample_size
       
    
    def _m_cond(self, class_cnt, class_attr_cnt):
        return (class_attr_cnt + (self._eq_sample_size*self._prior)) / (class_cnt + self._eq_sample_size)


    def _calc_priors(self, labels):
        return labels.value_counts() / len(labels)

    
    def _categorical_conditionals(self, df):
        "Assumes 2 columns, attribute and label, and the attribute values are categorical"
        attr_col = df.columns[0]
        class_col = df.columns[1]

        conds = []
        for y, df in data.groupby(class_col):
            class_count = len(df)
            conds += [[val,self._m_cond(class_count, cnt),y] for val,cnt in df[attr_col].value_counts().items()]

        return pd.DataFrame(conds,columns=['Value','Probability','Class'])
    
    
    def _continuous_conditionals(self, df):
        "Assumes 2 columns, attribute and label, and the attribute values are numeric"
        attr_col = df.columns[0]
        class_col = df.columns[1]

        stats = []
        for y, df in data.groupby(class_col):
            stats += [[df[attr_col].mean(),df[attr_col].std(),y]]

        return pd.DataFrame(stats,columns=['Mean','Std','Class'])
    

    def fit(self, data,class_column):
        result = {}

        for col,is_numeric in ((data.dtypes == 'float64') | (data.dtypes == 'int64')).items():
            if col != class_column:
                result[col] = {'type' : 'continuous', 'conditionals' : self._continuous_conditionals(data[[col,class_column]])} if is_numeric else {'type' : 'categorical', 'conditionals' : self._categorical_conditionals(data[[col,class_column]])}

        self._model = result
        self._priors = self._calc_priors(data[class_column])


    def _gaussian(self, val, params):
        std = params['Std']
        return (1.0/(math.sqrt(2*math.pi) * std)) * math.exp(-((val-params['Mean'])**2)/(2*(std**2)))

    def _calc_posterior(self, record, col):
        model_params = self._model[col]
        x = record[col]
        conds = model_params['conditionals']

        if model_params['type'] == 'categorical':
            return conds[conds['Value'] == x][['Probability','Class']]

        #else continuous
        gauss = lambda stats : pd.Series([self._gaussian(x,stats), stats['Class']])
        tmp = conds.apply(gauss,axis=1)
        return tmp.rename(columns={0 : 'Probability', 1 : 'Class'})


    def predict(self, record):
        if not hasattr(self, '_model'):
            raise Exception('The model has not been fitted, prediction is impossible')
        
        tmp_df = pd.DataFrame()
        for col in record.index:
            tmp_df = tmp_df.append(self._calc_posterior(record,col),ignore_index=True)

        posteriors = tmp_df.groupby('Class').prod()['Probability']*self._priors
        print(posteriors)
        
        return posteriors.idxmax()
    


In [25]:
data = pd.DataFrame({'Diet' : ['conventional','conventional','vegan','vegetarian','gluten-free','west-coast'],
                    'Married' : ['yes','no','no','yes','no','yes'],
                    'Salary' : [120000.0,34000.0,54000.0,75000.0,90000.0,65000.0],
                    'Default' : ['no','yes','yes','no','no','yes']})

nb = naive_bayes()

nb.fit(data,'Default')

new_record = pd.Series(['east-coast','no',50000.00],index=['Diet','Married','Salary'])

nb.predict(data.iloc[0,:-1])
nb.predict(data.iloc[1,:-1])
nb.predict(new_record)


no     6.930807e-07
yes    6.260629e-11
dtype: float64
no     1.902869e-08
yes    1.020828e-06
dtype: float64
no     3.479887e-07
yes    6.649849e-06
dtype: float64


'yes'