# Naive Bayes Classifiers

Naive Bayes classifiers are a collection of classification algorithms based on the __Bayes' Theorem__.

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

Where $A$ and $B$ are events and $P(B) \neq 0$.
-  $P(A|B)$ is a conditional probability: the likelihood of event $A$ occurring given that $B$ is true.
-  $P(B|A)$ is also a conditional probability: the likelihood of event $B$ occurring given that $A$ is true.
-  $P(A)$ and $P(B)$ are the probabilities of observing $A$ and $B$ independently of each other; this is known as the marginal probability.

It is not a single algorithms but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. This is a strong assumption but results in a fast and effective method.

The probability of a class value given a value of an attribute is called the conditional probability. By multiplying the conditional probabilities together for each attribute for a given class value, we have a probability of a data instance belonging to that class.

To make a prediction we can calculate probabilities of the instance belonging to each class and select the class value with the highest probability.

## Predict the Onset of Diabetes

The test problem we will use in this tutorial is the Pima Indians Diabetes problem.

This problem is comprise of 768 observations of medical details for Pima Indians patients. The records describe instantaneous measurements taken from the patients such as their age, the number of times pregnant and blood workup. All patients are women aged 21 or older. All attributes are numeric, and their units vary from attribute to attribute.

Each record has a class value that indicates whether the patient suffered an onset of diabetes within 5 years of when the measurements were taken (1) or not (0). This is a standard dataset that has been studied a lot in machine learning literature. A good prediction accuracy is 70%-76%.

Below is a sample of from the _pima-indians.data.csv_ file to get a sense of the data we will be working with:
```
 6,148,72,35,0,33.6,0.627,50,1
 1,85,66,29,0,26.6,0.351,31,0
 8,183,64,0,0,23.3,0.672,32,1
 1,89,66,23,94,28.1,0.167,21,0
 0,137,40,35,168,43.1,2.288,33,1
```

This tutorial is broken down into the following steps:
1. __Handle Data__: Load the data from the CSV file and split it into training and test datasets.
2. __Summarize Data__: Summarize the properties in the training dataset so that we can calculate probabilities and make predictions.
3. __Make a Prediction__: Use the summaries of the dataset to generate a single prediction.
4. __Make Predictions__: Generate predictions given a test dataset and a summarized training dataset.
5. __Evaluate Accuracy__: Evaluate the accuracy of predictions made for a test dataset as the percentage correct out of all predictions made.
6. __Tie It Together__: Use all of the code elements to present a complete and standalone implementation of the Naive Bayes algorithm.

### 1. Handle Data

The first thing we need to do is to read our data file. The data is in CSV format without a header line or any quotes. We also need to convert the attributes that were loaded as strings into numbers so that we can work with them. Below is the __read_csv()__ function for loading the Pima Indians dataset.

In [3]:
def read_csv(filename):
    with open(filename) as f:
        dataset = [[float(x) for x in line.split(',')] for line in f]
    return dataset

We can test this function by loading the Pima Indians dataset and printing the number of data instances that were loaded.

In [8]:
filename = 'pima-indians-diabetes.data.csv'
dataset = read_csv(filename)
print('Loaded data file "{0}" with {1} rows'.format(filename, len(dataset)))

Loaded data file "pima-indians-diabetes.data.csv" with 768 rows


Next we need to split the data into a training dataset that Naive Bayes can use to make predictions and a test dataset that we can use to evaluate the accuracy of the model. We need to split the dataset randomly into train and test datasets with a ratio of 67% train and 33% test (this is a common ratio for testing an algorithm on a dataset).

Below is the __train_test_split()__ function that will split a given dataset into a given split ratio.

In [3]:
import random

def train_test_split(dataset, split_ratio):
    train_size = int(len(dataset) * split_ratio)
    train = []
    test = list(dataset)
    while len(train) < train_size:
        index = random.randrange(len(test))
        train.append(test.pop(index))
    return train, test

We can test this out by defining a mock dataset with 5 instances, split it into training and testing datasets and print them out to see which data instances ended up where.

In [4]:
dataset = [[1], [2], [3], [4], [5]]
split_ratio = 0.67
train, test = train_test_split(dataset, split_ratio)
print('Split {0} rows into train with {1} and test with {2}'.format(len(dataset), train, test))

Split 5 rows into train with [[4], [5], [3]] and test with [[1], [2]]


### 2. Summarize Data

The Naive Bayes model is comprised of a summary of the data in the training dataset. This summary is then used when making predictions.

The summary of the training data collected involves the mean and the standard deviation for each attribute, by class value.