# Naive Bayes



## What is naive Bayes?

Naive Bayes is mostly used as a classifier. It's obviusly based in the Bayes' theorem combined with the independence (we suppouse this) of the variables.

We are talking about a probabilistic algorithm, so the answer of the output will be the probabilitis of something to happend. 

![alt text](https://i.stack.imgur.com/QiTPe.png "Gaussian Distribution")


## 1 - Dependencies

In [2]:
# Matrix operations
import numpy as np
# Plotting
import matplotlib.pyplot as plt
# In case you want to split the dataset into train and test
from sklearn.model_selection import train_test_split
# We want to measure the accuracy of the model
from sklearn import metrics

## Implementation

### 2 - Data Setup

We will supose that we already read the data with our csv library like we explained in the README.md

If we have another kind of data, you can use one of the multiple options described in that archive.

Lets suppouse we have x, y our dataset matrix made by words

In [1]:
X_train, X_test, y_train, y_test = train_test_split(text.data, text.target, test_size=0.3,random_state=rnd_number) # 70% training and 30% test

NameError: name 'train_test_split' is not defined

### 3 - Naive Bayes
The naive Bayes functions are implemented in:

In [3]:
from sklearn.naive_bayes import MultinomialNB # If your distribution is a multi
from sklearn.naive_bayes import GaussianNB # If you need a gaussian

###  4 - Model
If we are using sklearn it's really easy to train our model

In [5]:
# Create a Gaussian Classifier
model = GaussianNB()

# Train the model using the training sets
model.fit(X_train, y_train)

# Predict the response for test dataset
y_pred = model.predict(X_test)

### 5 - Predict
Just like the training part, testing is a short peace of code

In [1]:
# Predict the response for test dataset
y_pred = model.predict(X_test)

# Model Accuracy (a.k.a. norm between y_pred and the real data y_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

NameError: name 'model' is not defined

### * - Laplace

The well-known Zero Probability problem appears when we have a non-continuous set of posibilities. Theoretically, the distributions we are working with are continuous, but in the practices we have a finite number of data. This makes lot of gaps with 0 prob.

The solution to this problem is adding some new "invented" data to fill those gaps with a tiny prob that doesn't affect the hole model.

We will assume that we are trying to solve a natural languaje problem. We already defined the classes for each word, the vocabulary,...

In [3]:
 def laplace_smoothing(self, X_train, y_train, lambda):
        '''
            apply laplace smoothing, default lunda = 1 = add one smoothing
            on the posterior probability table
            lunda should <= 1
        '''
        for classes in all_classes:
            for word in self.vocabulary(X_train):
                count = self.count_word_in_classes[classes][word]
                self.similar[classes][word] = np.log((count + lamnda) / (words_in_class + lambda*len(self.vocabulary(X_train))))

SyntaxError: invalid syntax (<ipython-input-3-23a7b05d9f5c>, line 1)

## From Scratch
Code from
https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/

In [None]:
import csv
import random
import math
 
def loadCsv(filename):
	lines = csv.reader(open(filename, "rb"))
	dataset = list(lines)
	for i in range(len(dataset)):
		dataset[i] = [float(x) for x in dataset[i]]
	return dataset
 
def splitDataset(dataset, splitRatio):
	trainSize = int(len(dataset) * splitRatio)
	trainSet = []
	copy = list(dataset)
	while len(trainSet) < trainSize:
		index = random.randrange(len(copy))
		trainSet.append(copy.pop(index))
	return [trainSet, copy]
 
def separateByClass(dataset):
	separated = {}
	for i in range(len(dataset)):
		vector = dataset[i]
		if (vector[-1] not in separated):
			separated[vector[-1]] = []
		separated[vector[-1]].append(vector)
	return separated
 
def mean(numbers):
	return sum(numbers)/float(len(numbers))
 
def stdev(numbers):
	avg = mean(numbers)
	variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
	return math.sqrt(variance)
 
def summarize(dataset):
	summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
	del summaries[-1]
	return summaries
 
def summarizeByClass(dataset):
	separated = separateByClass(dataset)
	summaries = {}
	for classValue, instances in separated.iteritems():
		summaries[classValue] = summarize(instances)
	return summaries
 
def calculateProbability(x, mean, stdev):
	exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
	return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent
 
def calculateClassProbabilities(summaries, inputVector):
	probabilities = {}
	for classValue, classSummaries in summaries.iteritems():
		probabilities[classValue] = 1
		for i in range(len(classSummaries)):
			mean, stdev = classSummaries[i]
			x = inputVector[i]
			probabilities[classValue] *= calculateProbability(x, mean, stdev)
	return probabilities
			
def predict(summaries, inputVector):
	probabilities = calculateClassProbabilities(summaries, inputVector)
	bestLabel, bestProb = None, -1
	for classValue, probability in probabilities.iteritems():
		if bestLabel is None or probability > bestProb:
			bestProb = probability
			bestLabel = classValue
	return bestLabel
 
def getPredictions(summaries, testSet):
	predictions = []
	for i in range(len(testSet)):
		result = predict(summaries, testSet[i])
		predictions.append(result)
	return predictions
 
def getAccuracy(testSet, predictions):
	correct = 0
	for i in range(len(testSet)):
		if testSet[i][-1] == predictions[i]:
			correct += 1
	return (correct/float(len(testSet))) * 100.0
 
def main():
	filename = 'pima-indians-diabetes.data.csv'
	splitRatio = 0.67
	dataset = loadCsv(filename)
	trainingSet, testSet = splitDataset(dataset, splitRatio)
	print('Split {0} rows into train={1} and test={2} rows').format(len(dataset), len(trainingSet), len(testSet))
	# prepare model
	summaries = summarizeByClass(trainingSet)
	# test model
	predictions = getPredictions(summaries, testSet)
	accuracy = getAccuracy(testSet, predictions)
	print('Accuracy: {0}%').format(accuracy)
 
main()