# Naive Bayes Classifiers

Naive Bayes classifiers are a collection of classification algorithms based on the __Bayes' Theorem__.

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Where $A$ and $B$ are events and $P(B) \neq 0$.
-  $P(A|B)$ is a conditional probability: the likelihood of event $A$ occurring given that $B$ is true.
-  $P(B|A)$ is also a conditional probability: the likelihood of event $B$ occurring given that $A$ is true.
-  $P(A)$ and $P(B)$ are the probabilities of observing $A$ and $B$ independently of each other; this is known as the marginal probability.

It is not a single algorithms but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. This is a strong assumption but results in a fast and effective method.

The probability of a class value given a value of an attribute is called the conditional probability. By multiplying the conditional probabilities together for each attribute for a given class value, we have a probability of a data instance belonging to that class.

To make a prediction we can calculate probabilities of the instance belonging to each class and select the class value with the highest probability.

## Predict the Onset of Diabetes

The test problem we will use in this tutorial is the Pima Indians Diabetes problem.

This problem is comprise of 768 observations of medical details for Pima Indians patients. The records describe instantaneous measurements taken from the patients such as their age, the number of times pregnant and blood workup. All patients are women aged 21 or older. All attributes are numeric, and their units vary from attribute to attribute.

Each record has a class value that indicates whether the patient suffered an onset of diabetes within 5 years of when the measurements were taken (1) or not (0). This is a standard dataset that has been studied a lot in machine learning literature. A good prediction accuracy is 70%-76%.

Below is a sample of from the _pima-indians.data.csv_ file to get a sense of the data we will be working with:
```
 6,148,72,35,0,33.6,0.627,50,1
 1,85,66,29,0,26.6,0.351,31,0
 8,183,64,0,0,23.3,0.672,32,1
 1,89,66,23,94,28.1,0.167,21,0
 0,137,40,35,168,43.1,2.288,33,1
```

This tutorial is broken down into the following steps:
1. __Handle Data__: Load the data from the CSV file and split it into training and test datasets.
2. __Summarize Data__: Summarize the properties in the training dataset so that we can calculate probabilities and make predictions.
3. __Make a Prediction__: Use the summaries of the dataset to generate a single prediction.
4. __Make Predictions__: Generate predictions given a test dataset and a summarized training dataset.
5. __Evaluate Accuracy__: Evaluate the accuracy of predictions made for a test dataset as the percentage correct out of all predictions made.
6. __Tie It Together__: Use all of the code elements to present a complete and standalone implementation of the Naive Bayes algorithm.

### 1. Handle Data

The first thing we need to do is to read our data file. The data is in CSV format without a header line or any quotes. We also need to convert the attributes that were loaded as strings into numbers so that we can work with them. Below is the __read_csv()__ function for loading the Pima Indians dataset.

In [1]:
def read_csv(filename):
    with open(filename) as f:
        dataset = [[float(x) for x in line.split(',')] for line in f]
    return dataset

We can test this function by loading the Pima Indians dataset and printing the number of data instances that were loaded.

In [2]:
filename = 'pima-indians-diabetes.data.csv'
dataset = read_csv(filename)
print('Loaded data file "{0}" with {1} rows'.format(filename, len(dataset)))

Next we need to split the data into a training dataset that Naive Bayes can use to make predictions and a test dataset that we can use to evaluate the accuracy of the model. We need to split the dataset randomly into train and test datasets with a ratio of 67% train and 33% test (this is a common ratio for testing an algorithm on a dataset).

There's already a similar function given by the __scikit-learn__ library but for our purposes we're going to implement our own function. It's not very complicated, trust me.

Below is the __train_test_split()__ function that will split a given dataset into a given split ratio.

In [3]:
import random

def train_test_split(dataset, split_ratio):
    train_size = int(len(dataset) * split_ratio)
    train = []
    test = list(dataset)
    while len(train) < train_size:
        index = random.randrange(len(test))
        train.append(test.pop(index))
    return train, test

We can test this out by defining a mock dataset with 5 instances, split it into training and testing datasets and print them out to see which data instances ended up where.

In [4]:
dataset = [[1], [2], [3], [4], [5]]
split_ratio = 0.67
train, test = train_test_split(dataset, split_ratio)
print('Split {0} rows into train with {1} and test with {2}'.format(len(dataset), train, test))

### 2. Summarize Data

The Naive Bayes model is comprised of a summary of the data in the training dataset. This summary is then used when making predictions.

The summary of the training data collected involves the mean and the standard deviation for each attribute, by class value. For example, if there are two class values and 7 numerical attributes, then we need a mean and standard deviation for each attribute (7) and class value combination, that is 14 attribute summaries.

These are required when making predictions to calculate the probability of specific attribute values belonging to each class value.

We can break the preparation of this summary data down into the following sub-tasks:
1. Separate Data By Class
2. Calculate Mean
3. Calculate Standard Deviation
4. Summarize Dataset
5. Summarize Attributes By Class

#### Separate Data By Class

The first task is to separate the training dataset instances by class value so that we can calculate statistics for each class. We can do that by creating a map of each class value to a list of instances that belong to that class and sort the entire dataset of instances into the appropriate lists.

The __separate_by_class()__ function below does just this.

In [5]:
def separate_by_class(dataset):
    separated = {}
    for data in dataset:
        if data[-1] not in separated:
            separated[data[-1]] = []
        separated[data[-1]].append(data)
    return separated

You can see that the function assumes that the last attribute (-1) is the class value. The function returns a map of class values to lists of data instances.

We can test this function with some sample data, as follows:

In [6]:
dataset = [[1, 20, 1], [2, 21, 0], [3, 22, 1]]
separated = separate_by_class(dataset)
print('Separated instances: {0}'.format(separated))

#### Calculate Mean

We need to calculate the mean of each attribute for a class value. The mean is the central middle or central tendency of the data, and we will use it as the middle of our Gaussian distribution when calculating probabilities.

We also need to calculate the standard deviation of each attribute for a class value. The standard deviation describes the variation of spread of the data, and we will use it to characterize the expected spread of each attribute in our Gaussian distribution when calculating probabilities.

The standard deviation is calculated as the square root of the variance. The variance is calculated as the average of the squared differences for each attribute value from the mean. Note we are using the $N-1$ method, which subtracts 1 from the number of attribute values when calculating the variance.

In [7]:
import math

def mean(numbers):
    return sum(numbers) / float(len(numbers))

def stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x - avg)**2 for x in numbers]) / float(len(numbers) - 1)
    return math.sqrt(variance)

We can test this by taking the mean of the numbers from 1 to 5.

In [8]:
numbers = [1, 2, 3, 4, 5]
print('Summary of {0}: mean = {1}, stdev = {2}'.format(numbers, mean(numbers), stdev(numbers)))

#### Summarize Dataset

Now we have the tools to summarize a dataset. For a given list of instances (for a class value) we can calculate the mean and the standard deviation for each attribute.

The zip function groups the values for each attribute across our data instances into their own lists so that we can compute the mean and standard deviation values for the attribute.

In [9]:
def summarize(dataset):
    summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
    del summaries[-1]
    return summaries

We can test this __summarize()__ function with some test data that shows markedly different mean and standard deviation values for the first and second data attributes.

In [10]:
dataset = [[1, 20, 0], [2, 21, 1], [3, 22, 0]]
summary = summarize(dataset)
print('Attribute summaries: {0}'.format(summary))

#### Summarize Attributes By Class

We can pull it all together by first separating our training dataset into instances grouped by class. Then calculate the summaries for each attribute.

In [11]:
def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = {}
    for class_value, instances in separated.items():
        summaries[class_value] = summarize(instances)
    return summaries

We can test this __summarize_by_class()__ function with a small test dataset.

In [12]:
dataset = [[1, 20, 1], [2, 21, 0], [3, 22, 1], [4, 22, 0]]
summary = summarize_by_class(dataset)
print('Summary by class value: \n{0}'.format(summary))

Summary by class value: 
{1: [(2.0, 1.4142135623730951), (21.0, 1.4142135623730951)], 0: [(3.0, 1.4142135623730951), (21.5, 0.7071067811865476)]}


### 3. Make Prediction

We are now ready to make predictions using the summaries prepared from our training data. Making predictions involves calculating the probability that a given data instance belongs to each class, then selecting the class with the largest probability as the prediction.

We can divide this part into the following tasks:
1. Calculate Gaussian Probability Density Function
2. Calculate Class Probabilities
3. Make a Prediction
4. Estimate Accuracy

#### Calculate Gaussian Probability Density Function

We can use a Gaussian function to estimate the probability of a given attribute value, given the known mean and standard deviation for the attribute estimated from the training data.

Given that the attribute summaries were prepared for each attribute and class value, the result is the conditional probability of a given attribute value given a class value.

In summary, we are plugging our known details into the Gaussian (attribute value, mean and standard deviation) and reading off the likelihood that our value belongs to the class. We can do this manually, or use the __st.norm.pdf()__ function provided by the __scipy.stats__ library.

In [13]:
import math
import scipy.stats as st

# Manual calculation
def calculate_probability(x, mean, stdev):
    exponent = math.exp(-(x - mean)**2 / (2 * stdev**2))
    return exponent / (math.sqrt(2 * math.pi) * stdev)

# Using scipy.stats
def calculate_probability(x, mean, stdev):
    return st.norm.pdf(x, loc = mean, scale = stdev)

We can test this with some sample data, as follows.

In [14]:
x = 71.5
mean = 73
stdev = 6.2
probability = calculate_probability(x, mean, stdev)
print('Probability of belonging to this class: {0}'.format(probability))

Probability of belonging to this class: 0.06248965759370005


#### Calculate Class Probabilities

Now that we can calculate the probability of an attribute belonging to a class, we can combine the probabilities of all of the attribute values for a data instance and come up with a probability of the entire data instance belonging to the class.

We combine probabilities together by multiplying them. In the __calculate_class_probabilities()__ below, the probability of a given data instance is calculated by multiplying together the attribute probabilities for each class. The result is a map of class values to probabilities.

In [15]:
def calculate_class_probabilities(summaries, input_vector):
    probabilities = {}
    for class_value, class_summaries in summaries.items():
        probabilities[class_value] = 1
        for i in range(len(class_summaries)):
            mean, stdev = class_summaries[i]
            x = input_vector[i]
            probabilities[class_value] *= calculate_probability(x, mean, stdev)
    return probabilities

We can test the __calculate_class_probabilities()__ function.

In [17]:
summaries = {0: [(1, 0.5)], 1: [(20, 5.0)]}
input_vector = [1.1, '?']
probabilities = calculate_class_probabilities(summaries, input_vector)
print('Probabilities for each class:\n{0}'.format(probabilities))

Probabilities for each class:
{0: 0.7820853879509118, 1: 6.298736258150437e-05}


#### Make a prediction

Now that we can calculate the probability of a data instance belonging to each class value, we can look for the largest probability and return the associated class.

The __predict()__ function below does just that.

In [20]:
def predict(summaries, input_vector):
    probabilities = calculate_class_probabilities(summaries, input_vector)
    _, best_label = max((probability, class_value) for class_value, probability in probabilities.items())
    return best_label

We can test the __predict()__ function as follows:

In [22]:
summaries = {'A': [(1, 0.5)], 'B': [(20, 5.0)]}
input_vector = [1.1, '?']
prediction = predict(summaries, input_vector)
print('Prediction: {0}'.format(prediction))

Prediction: A


### 4. Make Predictions

Finally, we can estimate the accuracy of the model by making predictions for each data instance in our test dataset. The __get_predictions()__ function will do this and return a list of predictions for each test instance.

In [23]:
def get_predictions(summaries, test_set):
    predictions = [predict(summaries, input_vector) for input_vector in test_set]
    return predictions

We can test the __get_predictions()__ function.

In [24]:
summaries = {'A': [(1, 0.5)], 'B': [(20, 5.0)]}
test_set = [[1.1, '?'], [19.1, '?']]
predictions = get_predictions(summaries, test_set)
print('Predictions: {0}'.format(predictions))

Predictions: ['A', 'B']


### 5. Get Accuracy

The predictions can be compared to the class values in the test dataset and a classification accuracy can be calculated as an accuracy ratio between 0% and 100%. The __get_accuracy()__ function will calculate this ratio.

In [25]:
def get_accuracy(test_set, predictions):
    correct = 0
    for i in range(len(test_set)):
        if test_set[i][-1] == predictions[i]:
            correct += 1
    return correct / float(len(test_set)) * 100.0

In [26]:
test_set = [[1,1,1,'a'], [2,2,2,'a'], [3,3,3,'b']]
predictions = ['a', 'a', 'a']
accuracy = get_accuracy(test_set, predictions)
print('Accuracy: {0}'.format(accuracy))

Accuracy: 66.66666666666666


### 6. Tie It Together

Finally, we need to tie it all together.

Below provides the full code listing for Naive Bayes implemented from scratch in Python.

In [31]:
# Example of Naive Bayes implemented from scratch in Python
import random
import math

def read_csv(filename):
    with open(filename) as f:
        dataset = [[float(x) for x in line.split(',')] for line in f]
    return dataset

def train_test_split(dataset, split_ratio):
    train_size = int(len(dataset) * split_ratio)
    train = []
    test = list(dataset)
    while len(train) < train_size:
        index = random.randrange(len(test))
        train.append(test.pop(index))
    return train, test

def separate_by_class(dataset):
    separated = {}
    for data in dataset:
        if data[-1] not in separated:
            separated[data[-1]] = []
        separated[data[-1]].append(data)
    return separated

def mean(numbers):
    return sum(numbers) / float(len(numbers))

def stdev(numbers):
    avg = mean(numbers)
    variance = sum([(x - avg)**2 for x in numbers]) / float(len(numbers) - 1)
    return math.sqrt(variance)

def summarize(dataset):
    summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
    del summaries[-1]
    return summaries

def summarize_by_class(dataset):
    separated = separate_by_class(dataset)
    summaries = {}
    for class_value, instances in separated.items():
        summaries[class_value] = summarize(instances)
    return summaries

def calculate_probability(x, mean, stdev):
    exponent = math.exp(-(x - mean)**2 / (2 * stdev**2))
    return exponent / (math.sqrt(2 * math.pi) * stdev)

def calculate_class_probabilities(summaries, input_vector):
    probabilities = {}
    for class_value, class_summaries in summaries.items():
        probabilities[class_value] = 1
        for i in range(len(class_summaries)):
            mean, stdev = class_summaries[i]
            x = input_vector[i]
            probabilities[class_value] *= calculate_probability(x, mean, stdev)
    return probabilities

def predict(summaries, input_vector):
    probabilities = calculate_class_probabilities(summaries, input_vector)
    _, best_label = max((probability, class_value) for class_value, probability in probabilities.items())
    return best_label

def get_predictions(summaries, test_set):
    predictions = [predict(summaries, input_vector) for input_vector in test_set]
    return predictions

def get_accuracy(test_set, predictions):
    correct = 0
    for i in range(len(test_set)):
        if test_set[i][-1] == predictions[i]:
            correct += 1
    return correct / float(len(test_set)) * 100.0

if __name__ == '__main__':
    dataset = read_csv('pima-indians-diabetes.data.csv')
    train, test = train_test_split(dataset, 0.67)
    print('Split %d rows into train = %d and test = %d rows'
          % (len(dataset), len(train), len(test)))
    
    # Prepare model
    summaries = summarize_by_class(train)
    
    # Test model
    predictions = get_predictions(summaries, test)
    accuracy = get_accuracy(test, predictions)
    print('Accuracy: {0}%'.format(accuracy))

Split 768 rows into train = 514 and test = 254 rows
Accuracy: 76.77165354330708%


## Implementation Extensions

This section provides you with ideas for extensions that you could apply and investigate with the Python code you have implemented as part of this tutorial.

You have implemented your own version of Gaussian Naive Bayes in python from scratch.

You can extend the implementation further.

-  __Calculate Class Probabilities__: Update the example to summarize the probabilities of a data instance belonging to each class as a ratio. This can be calculated as the probability of a data instance belonging to one class, divided by the sum of the probabilities of the data instance belonging to each class. For example, an instance had the probability of 0.02 for class A and 0.001 for class B, the likelihood of the instance belonging to class A is 0.02/(0.02 + 0.001) * 100 which is about 95.23%.
-  __Log Probabilities__: The conditional probabilities for each class given an attribute value are small. When they are multiplied together, they result in very small values, which can lead to floating point underflow (numbers are too small to represent in Python). A common fix for this is to combine the log of the probabilities together. Research and implement this improvement.
-  __Nominal Attributes__: Update the implementation to support nominal attributes. This is much similar and the summary information you can collect for each attribute is the ratio of category values for each class.
-  __Different Density Function__ (_bernoulli_ or _multinomial_): We have looked at Gaussian Naive Bayes, but you can also look at other distributions. Implement a different distribution such as _multinomial_, _bernoulli_ or _kernel naive bayes_ that make different assumptions about the distribution of attribute values and/or their relationship with the class value.