<a href="https://colab.research.google.com/github/HenriARM/ML/blob/master/naive-bayes/naive-bayes-iris.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Using [Gaussian naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) for [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set) classification.

# **Problem**
Given an object to be classified, represented by a vector $x = (x_1 \dots x_n) $ with n independent features, for each of $K$ possible classes $C_k$.

# **Bayes probability model**

[Bayes theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem): $ p(C_k \mid x) = \frac{p(C_k) p(x \mid C_k)}{p(x)}$

The denominator of fraction does not depend on $C_k$ that's why $p(x)$ is constant and we won't use it.

The numerator is equivalent to the joint probability:
$p(C_k, x_1, \dots, x_n)$ = $p(x_1 \mid x_2, \dots, x_n, C_k)$ $p(x_2 \mid x_3, \dots, x_n, C_k)$ $\dots$ $p(x_{n-1} \mid x_n, C_k)$ $p(x_n \mid C_k)$ $p(C_k)$

We assume that all features in $x$ are mutually independent:
$p(x_i \mid x_{i+1} \dots x_n, C_k) = p(x_i \mid C_k) $



# **Naive Bayes classifier**
The naive Bayes classifier combines Bayes probability model with a decision rule.

$y = \underset{k \in \{1, \dots, K\}}{argmax} [p(C_k)\Pi_{i=1}^{n}p(x_i \mid C_k)]$

# **Gaussian naive Bayes**

Uses [Gaussian](https://en.wikipedia.org/wiki/Normal_distribution) (normal) distribution when dealing with continuous data: 
 $ p(x \mid C_k)={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {x-\mu }{\sigma }}\right)^{2}} $ 

$\mu = \frac{1}{n} \sum_{i = 1}^{n} {x_i} $ (mean)

$\sigma = \sqrt{ \frac{1}{n} \sum_{i = 1}^{n} {(x_i - \bar{x})^2} }$ (standard deviation) 



In [0]:
from math import sqrt, pi, exp

""" Math class with static methods used for working with arrays """
class Math:
  
  """ Mean 
  Args:
    x: iterable object
  Returns:
    <class 'float'>
  """
  @staticmethod
  def mean(x):
    return sum(x)/len(x)

  """ Standard deviation 
  Args:
    x: iterable object, shape (n)
    mean: mean of x
  Returns:
    <class 'float'>
  """
  @staticmethod
  def stdev(x, mean):
    return sqrt( (1/len(x)) * sum([(x_i - mean)**2 for x_i in x]) )
  
  """ Gaussian (Normal) distribution
  Args:
    x: iterable object shape (n)
    mean: mean of x
    stdev: standard deviation of x
  Returns:
    <class 'float'>
  """
  @staticmethod
  def normal_distribution(x, mean, stdev):
    exponent = exp( -1/2 * ( ( ( x - mean ) / stdev )**2 ) )
    return ( 1 / ( stdev * sqrt(2 * pi) ) ) * exponent


In [0]:
""" Gaussian naive Bayes classifier """
class GaussianNB:
	def __init__(self):
		self.class_amount = 0
		self.feauture_amount = 0
		self.data_amount = 0
		# dictionary with separated vectors by their class
		self.separated = None
		# dictionary, which stores stdev and mean for each feauture in class used for normal distribution (P(x_i|C_k)), x_i - vector dimension, C_k - class
		self.features = None

	""" Separates training data vectors by their classes
  Args:
    x: iterable object of classified vectors, shape (n_samples, n_features)
		y: iterable object of vector classes, shape (n_samples, 1)
  Returns:
	 Dictionary wiht list of vectors indexed by class
  """
	def __separate_data_by_class(self, x, y):
		d = dict()
		for i in range(len(x)):
			vector = x[i]
			vclass = y[i]
			# create new list for vector class if didn't exist before
			if (vclass not in d):
				d[vclass] = list()
	  	# append object to the class
			d[vclass].append(vector)
		return d

	""" Fit Gaussian Naive Bayes according to X, y
  Args:
    x: iterable object of classified vectors, shape (n_samples, n_features)
		y: iterable object of vector classes, shape (n_samples, 1)
  """
	def fit(self, x, y):
		self.data_amount, self.feature_amount  = x.shape
		self.separated = self.__separate_data_by_class(x,y)
		self.class_amount = len(self.separated)
		# get stdev and mean for each feature in class
		self.features = dict()
		for k in range(self.class_amount):
			self.features[k] = list()
			for i in range(self.feature_amount):
				column = np.array(self.separated[k])[:,i]
				m = Math.mean(column)
				sd = Math.stdev(column, m)
				self.features[k].append((m, sd))
		print(type(self.separated))

	""" Calculate the probabilities
  Args:
    x_test: vector to predict
  Returns:
	 index of class
  """
	def predict(self, x_test):
		probabilities = dict()
		for k in range(self.class_amount):
			# Calculate p(C_K), amount of vectors with C_k class divided by amount of all train vectors
			p_c_k = len(self.separated[k]) / self.data_amount
			probabilities[k] = p_c_k
			for i in range(self.feature_amount):
				m, sd = self.features[k][i]
				reiz = Math.normal_distribution(x_test[i], m, sd)
				probabilities[k] *= reiz

		print("Probabilities:", probabilities)
		# Find class with highest probability
		arr = np.array(list(probabilities.values()))
		y = arr.argmax(axis = 0)
		return y

In [22]:
import numpy as np
# Iris dataset is available in scikit
from sklearn.datasets import load_iris
 
x_train, y_train = load_iris(return_X_y = True)
classes = load_iris().target_names

classifier = GaussianNB()
classifier.fit(x_train, y_train)

x_test = np.array(x_train[49])
class_index = classifier.predict(x_test)
print("Flower types: ", classes)
flower_name = classes[class_index]
print("The flower with features: %s is %s" %(x_test, flower_name))

<class 'dict'>
Probabilities: {0: 2.8835592775381644, 1: 1.0328680590068721e-17, 2: 3.2212082936497125e-25}
Flower types:  ['setosa' 'versicolor' 'virginica']
The flower with features: [5.  3.3 1.4 0.2] is setosa
