# Naive Bayes

### Bayes'  Theorem
* Think about making a prediction, given a certain set of information
* You have **Prior** beliefs, which is what you would assume without the data. Typically, this is the strict proportion of samples that have a certain category.
* You have the **:ikelihood** function, which is the probability of observing the data given the observation. This is seperate from the posterior, 
* You have the **Normalization**, which is how likely you are to observe the data at all. You must divide this out because you want to prove that it is "given" the data.

>$P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$
0. $P(Y|X) = Posterior$
1. $P(Y) or P(Y|E)= Prior$
2. $P(X|Y) = Likelihood$
3. $P(D|E) = Normalization$

### The "Naive" assumption
* To be precise, $P(X|Y)$ might require that you already observed a case with the right mix of data. This might not be true when you have many features to your data which each have many options.
* Instead, you can naively assume that all of the features of the data are unrelated. The result is that the likelihood can be expressed as the simple multiple of the probabilities of all features

### Naive Bayes on a Natural Language Processing (NLP) Example
* While Naive bayes is an excellent general purpose model, it is especially often used for natural language processing
* Generally, the position of the words does not matter in natural language processing, thus each index where a word could be is not considered an individual feature.
* Therefore, when calculating probabilities in NLP, you calculate it based on all of the data rather than just the data in a particular category

<center>Resources</center>

https://en.wikipedia.org/wiki/Additive_smoothing

https://stats.stackexchange.com/questions/218492/how-does-naive-bayes-work-with-continuous-variables

http://lib.stat.cmu.edu/datasets/boston

### Initial Model
* Ended up being too slow due to training dataset values being calculated every time a test is run

In [46]:
import numpy as np
import tensorflow as tf
import tensorflow.keras as keras
import sklearn as skl

# Additive smoothing (add-one smoothing aka Laplace Smoothing)
# The key idea behind laplace smoothing is an assumption that you add one more sample to each value that a feature could recieve.
# For example, say you are trying to calculate P(x=1); You would add 1 to the count of each value that x could take on.
def laplace_smoothing(Num_trials_considered, fulfilling_x, Num_possibilities_x, a=1):
    """
    The key idea behind laplace smoothing is an assumption that you add one more sample to each value that a feature could recieve. For example, say you are trying to calculate P(x=1); You would add 1 to the count of each value that x could take on.
    
    Parameters
    ----------
    Num_trials_considered : int
        number of trials being looked at (same as number of observations). Bounded by a particular y valu. Can be expressed as len(y[y==y_intended])
    Num_possibilities_x : int
        number of posibilites that the specified feature could take on.
    fulfilling_x : int
        the number of trials (out of N) where the specified feature attains a specific value. THe sum of all possible values of fulfilling_x (ie, sum of fulfilling_x for all values of the specified feature) should be equal to k.
    a : int, default=1
        the addition constant. Represents the number of extra observations added to each value of a feature.
    """
    # Typically we would just take x/N
    lp = (fulfilling_x + a) / (Num_trials_considered + a * Num_possibilities_x)
    
    return lp
# Naive bayes expressed in code
def naive_bayes(y, x, categories, data, lp_smoothing = True):
    y = np.array(y)
    x = np.array(x)
    
    posteriors = {}
    norm_likelihood_all = []
    prior_all = []
    
    num_samples = x.shape[0]
    num_features = x.shape[1]
    
    for category in categories:
        
        given_cat = (y == category)
        if lp_smoothing:
            prior = laplace_smoothing(Num_trials_considered = num_samples, fulfilling_x = np.count_nonzero(given_cat), Num_possibilities_x = len(categories), a = 1)
        else:
            prior = np.count_nonzero(given_cat)/len(y)
        
        x_given_cat = x[given_cat]
        norm_likelihood = 1
        for i in range(len(data)):
            feature_likelihood = x_given_cat[:, i]
            feature_normalization = x[:, i]
            if lp_smoothing:
                if norm_likelihood == 0:
                    #print(i)
                    pass
                norm_likelihood *= laplace_smoothing(Num_trials_considered = np.count_nonzero(given_cat), fulfilling_x = np.count_nonzero(feature_likelihood == data[i]), Num_possibilities_x = 10000, a = 1) / laplace_smoothing(Num_trials_considered = num_samples, fulfilling_x = np.count_nonzero(feature_normalization == data[i]), Num_possibilities_x = num_features, a = 1)
            else:
                norm_likelihood *= np.count_nonzero(feature == data[i])/np.count_nonzero(given_cat) / np.count_nonzero(feature == data[i])/np.count_nonzero(y)
                
        prior_all.append(prior)
        norm_likelihood_all.append(norm_likelihood)
        
        posteriors[category] = prior*norm_likelihood
    return (posteriors, prior_all, norm_likelihood)

### Class Model

In [92]:
# New, memory efficient version

class naive_b():
        
    def train(self,x_train,y_train,verbose = True):
        assert x_train.ndim == 2
        
        vocab_size = x_train.shape[1]
        # Assume classes are sequential integers starting at 1
        num_classes = int(max(y_train)) + 1
        
        normalizations = np.sum(x_train, axis = 0)
        normalizations /= np.sum(normalizations)
        
        priors = np.zeros(num_classes,)
        likelihoods = np.zeros((num_classes, vocab_size))
        
        for i in range(num_classes):
            y_class = (y_train == i)
            class_count = np.count_nonzero(y_class)
            priors[i,] = class_count/len(y_train)
            
            x_class = x_train[y_class]
            likelihoods[i,:] = (np.sum(x_class, axis = 0) + 1) / (class_count + vocab_size)
        self.priors = priors
        self.likelihoods = likelihoods
        self.normalizations = normalizations
    
    def test(self,x_test):
        posteriors_all = []
        preds = []
        for i in range(len(x_test)):
            likelihood = np.prod(self.likelihoods[:, x_test[i] != 0], axis = 1)
            normalization = (self.normalizations*x_test[i])
            normalization = np.prod(normalization[normalization != 0])
            posteriors = (self.priors * likelihood) / normalization
            
            posteriors_all.append(posteriors)
            preds.append(np.argmax(posteriors))
        return preds, posteriors_all

In [77]:
# Trying to use Naive bayes with Reuters dataset

from tensorflow.keras.datasets import reuters
from sklearn.feature_extraction.text import CountVectorizer

vocab_size = 500

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words = vocab_size)

print('Currently the data comes in the form of variable length lists (e.g., below). We want to make everything the same length to make it computer friendly')
print('len ex. 1:', len(train_data[0]), 'len ex. 2:', len(train_data[1]))

Currently the data comes in the form of variable length lists (e.g., below). We want to make everything the same length to make it computer friendly
len ex. 1: 87 len ex. 2: 56


In [122]:
# An example of what a simple vectorizer might do
# def vectorizer(sequences, dimension=500):
#     # Convert lists of different sizes to boolean vectors with length equal to the total number of words
#     vectorized = np.zeros((len(sequences), dimension))
#     for i, sequence in enumerate(sequences):
#         vectorized[i, sequence] = 1
#     return vectorized

def count_vectorizer(list_of_lists, vocab_size):
    vectorized_sequences= []
    for sequence in list_of_lists:
        unique, counts = np.unique(np.array(sequence), return_counts = True)
        vectorized = np.zeros(vocab_size)
        vectorized[unique] = counts
        vectorized_sequences.append(vectorized)
    out = np.stack(vectorized_sequences, axis = 0)
    return out

x_train = count_vectorizer(train_data, vocab_size)
x_test = count_vectorizer(test_data, vocab_size)
y_train = np.array(train_labels)
y_test = np.array(test_labels)

print(x_train.shape, x_test[0].shape)

(8982, 500) (500,)


In [121]:
model = naive_b()
model.train(x_train, y_train)
preds, posteriors = model.test(x_test)
# Relatively high accuracy
print('accuracy: ', np.count_nonzero(y_test == preds) / len(y_test))
print('random guess: ', 1/(max(y_train) - min(y_train)))

accuracy:  0.5952804986642921
random guess:  0.022222222222222223




In [120]:
from sklearn.naive_bayes import MultinomialNB

sk_model = MultinomialNB()
sk_model.fit(x_train, y_train)
sk_preds = sk_model.predict(x_test)
# Relatively high accuracy
print('accuracy: ', np.count_nonzero(y_test == sk_preds) / len(y_test))
print('random guess: ', 1/(max(y_train) - min(y_train)))

accuracy:  0.7141585040071238
random guess:  0.022222222222222223




### Boston Housing
* Discretizing the boston housing dataset is hard because there are too few features

#### Boston Housing Dataset Data Lables
* CRIM     per capita crime rate by town
* ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS    proportion of non-retail business acres per town
* CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
* NOX      nitric oxides concentration (parts per 10 million)
* RM       average number of rooms per dwelling
* AGE      proportion of owner-occupied units built prior to 1940
* DIS      weighted distances to five Boston employment centres
* RAD      index of accessibility to radial highways
* TAX      full-value property-tax rate per \$10,000
* PTRATIO  pupil-teacher ratio by town
* B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT    % lower status of the population
* MEDV     Median value of owner-occupied homes in \$1000's

In [124]:
from tensorflow.keras.datasets import boston_housing
import pandas as pd

(bh_x_train, bh_y_train), (bh_x_test, bh_y_test) = boston_housing.load_data()

print(bh_x_train.shape, bh_x_test.shape)

# Normalization routine
def z_score(x_train, x_test):
    # Note that all transformation on the data are based on the training data
    # The goal is to make sure that you don't accidentally train on any aspects of the test data
    mean = x_train.mean(axis = 0)
    std = x_train.std(axis = 0)
    
    x_train_norm = (x_train - mean) / std
    x_test_norm = (x_test - mean) / std
    
    return (x_train_norm, x_test_norm, mean, std)

bh_x_train, bh_x_test, mean, std = z_score(bh_x_train, bh_x_test)

(404, 13) (102, 13)


In [125]:
from sklearn.preprocessing import KBinsDiscretizer

descretizer_x = KBinsDiscretizer(n_bins = 3, encode = 'onehot-dense', strategy='quantile')

print(bh_x_train.shape)
bh_x_train = descretizer_x.fit_transform(bh_x_train)
bh_x_test = descretizer_x.transform(bh_x_test)

(404, 13)


  'decreasing the number of bins.' % jj)
  'decreasing the number of bins.' % jj)


In [115]:
# This predictor has a much lower success rate because we do not have the same number of features and categories within features.

bh_model = naive_b()
bh_model.train(bh_x_train, bh_y_train)
bh_preds, bh_posteriors = bh_model.test(bh_x_test)
print(np.count_nonzero(bh_y_test == bh_preds) / len(bh_y_test))

0.0392156862745098


### 1. Type of Data
* Data that can be counted and can be probabalized (e.g., word data)
  * Discrete data

### 2. Use Case
* Natural Language Processing
* Multiclass identification

### 3. Basic Concept
* Use Bayes' theorem to identify the probability of each example belonging to each class.
* The naive assumption is that the features are unrelated
* The class with the highest probability is used

### 4. Assumptions
* Words are unrelated
* Bag-of-words is effective

### 5. Application
* NLP is particularily well suited 

### 6. Existing solutions
* SKlearn

### 7. Strengths and Weaknesses
#### Strengths
* Simple, easy to understand
* Given sufficient data, can be very effective

#### Weaknesses
* Can be highly memory intensive during training
* Needs special techniques to make sure underflows do not happen