 # Naive Bayes Classifier 

Bayes classifiers aim to determine the most probable class for a data point based on its features, using Bayes' theorem. They calculate the probability of each class given the observed features, considering prior knowledge and the likelihood of those features. The class with the highest probability is then assigned as the prediction.  This approach allows for probabilistic predictions, incorporating uncertainty and prior beliefs into the classification process.

$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $

Gaussian Naive Bayes assumes continuous features within each class follow a normal distribution, using the provided PDF to calculate likelihoods. It estimates the mean and variance for each feature per class from the training data. Then, it applies the Naive Bayes assumption of feature independence to compute the probability of a class given the features. Finally, it often uses log probabilities for numerical stability.

$ P(X | C_k) = P(x_1 | C_k) \cdot P(x_2 | C_k) \cdot ... \cdot P(x_n | C_k) = \prod_{i=1}^{n} P(x_i | C_k) $

Here X is the instances vector, $C_k$ is the class and $x_n$ is a feature within the data instance

To prevent numerical underflow, especially when dealing with many small probabilities, we convert the Naive Bayes equation into logarithms and sums. Multiplying numerous tiny probabilities can result in values too small for computers to represent accurately. By taking the logarithm, products become sums, which are computationally more stable and efficient. This conversion maintains the relative ordering of probabilities, ensuring that the class with the highest probability remains the highest after the logarithmic transformation.

$ \log P(C_k | X) = \log(P(C_k)) + \sum_{i=1}^{n} \log(P(x_i | C_k)) - \log(P(X)) $

# Algorithm Steps

The goal is to select the class with the highest posterior probability: $ argmax P(C_k|X) $

That is to say, as we have obversed a data point X, what is the most likely class it belongs to

1. **Get rid of the demoninator of Bayes Theorem $ P(X) $**. <br>
   This is the probability of observing X the data point we are focusing on. It is the same regardless of what class we are calculating, hence it is constant and its not needed to identify the highest number, i.e. the most likely class. Note, removing this term means we are calculating the proportional posterior, not the exact probabilty hence the output is only useful for classification itself, not probabilitic modelling. It makes calculations quicker and easier

2. **Calculate Prior** <br>
    Prior = $P(y)$ = Frequency of each class in the whole data <br>
    shape[0] of class subset/ shapre[0] total data

3. **Class Conditionals** <br>
    Class Condition Probabilities for each feature of a data instance = $P(x_i|y)$ Modelled using Probability Density Function (PDF) <br>

    ---

   
    **When features are continuous, we often assume they follow a normal (Gaussian) distribution within each class.** <br>
    The probability density function (PDF) for a normal distribution is:<br>
    $$P(x_i | C_k) = N(x | \mu, \sigma^2) = f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

   
    To use the normal distribution, you need to estimate $μ$ and $σ_2$ for each feature and each class from the training data:

    $$\mu_{ik} = \frac{1}{N_{ck}} \sum_{x_i \in C_k} x_i$$

    $$\sigma_{ik}^2 = \frac{1}{N_{ck} - 1} \sum_{x_i \in C_k} (x_i - \mu_{ik})^2$$

    Plug the observed value of x_i into the normal distribution PDF using the estimated $μ_ik$ and $σ^2_ik$

    Once each features conditional probability has been calculated you mutliply these all together:

    $$P(X | C_k) = \prod_{i=1}^{n} P(x_i | C_k) = P(x_1 | C_k) \cdot P(x_2 | C_k) \cdot ... \cdot P(x_n | C_k)$$

    ---
    **For categorical features, you calculate the probability by counting the frequency of each category within each class.**

   $$P(x_i = v | C_k) = \frac{\text{Count of feature } x_i = v \text{ in class } C_k}{\text{Count of data points in class } C_k}$$

   Where $v$ is a specific category of feature $x_i$.


   To avoid probabilities of zero when a category is not observed in a class, Laplace smoothing is often used:

    $$P(x_i = v | C_k) = \frac{\text{Count of feature } x_i = v \text{ in class } C_k + 1}{\text{Count of data points in class } C_k + \text{Number of categories in } x_i} $$

   ---

5. **Convert equation to logarthims and sums** <br>
    $$\log P(C_k | X) = \log(P(C_k)) + \sum_{i=1}^{n} \log(P(x_i | C_k)) - \log(P(X))$$


6. **Execution Steps** <br>
   Training: Calculate means, variances and priors <br>

   Preduction: Calculate Posterior for each class <br>

# Loading The Dataset

In [1]:
from sklearn.datasets import load_iris
import numpy as np

# Load the Iris dataset
iris = load_iris()

# Extract the data (features) into X
X = iris.data

# Extract the labels (target) into Y
Y = iris.target

# Optional: Verify the shapes and data types
print("Shape of X:", X.shape)
print("Shape of Y:", Y.shape)
print("Data type of X:", X.dtype)
print("Data type of Y:", Y.dtype)

# Optional: Print the first few rows of X and Y
print("\nFirst 5 rows of X:")
print(X[:5])
print("\nFirst 5 values of Y:")
print(Y[:5])

Shape of X: (150, 4)
Shape of Y: (150,)
Data type of X: float64
Data type of Y: int64

First 5 rows of X:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

First 5 values of Y:
[0 0 0 0 0]


# Naive Bayes Class

In [49]:
import numpy as np

class NaiveBayes:
    
    def fit(self, data, labels):
        print("x")
        # identify the classes and features
        print(data.shape)

        n_instances, n_features = data.shape
        self._classes = np.unique(labels)
        n_classes = len(self._classes)

        print(n_instances, n_features, self._classes, n_classes)

        # set up zeros for mean, var and priors for each class rows classes, columns features
        self._mean = np.zeros((n_classes, n_features), dtype=np.float64)
        self._var = np.zeros((n_classes, n_features), dtype=np.float64)
        self._prior =np.zeros((n_classes, 1), dtype=np.float64)

        for idx, class_label in enumerate(self._classes):
            print(f"Processing Class {idx}")
            class_subset = data[labels == class_label]

            self._mean[idx, :] =  class_subset.mean(axis=0)
            self._var[idx, :] =  class_subset.var(axis=0)
            self._prior[idx, :] =  class_subset.shape[0] / data.shape[0]


    
    def predict(self, test_set):
        y_pred = [self._predict(x) for x in test_set]
        return np.array(y_pred)

    def predict(self, instance):
        print(instance)
        # each 
         # for idx, class_label in enumerate(self._classes):
             

        # per class calculate the posterior of Ck gives our X point
        # within a class and for each feature use the pdf to calculate the conditionals
        # mutliply all the conditionals together
        # multiply the conditonal of the class by the prior of the class 
        # store value
        # repeat for each class and select highest
        

    def pdf(self, x, mean, var):
        # x is the feature of a data instance
        
        numerator = (x - mean)^2
        denominator = (2 * var)
    
        return 1 / sqrt(2 * pi * var) * exp^(-(numerator/denominator)) 
        
        

# Execution

In [51]:
nb = NaiveBayes()

nb.fit(X, Y)

nb.predict()

# Not finished
# https://www.youtube.com/watch?v=TLInuAorxqE

x
(150, 4)
150 4 [0 1 2] 3
Processing Class 0
[[5.006 3.428 1.462 0.246]
 [0.    0.    0.    0.   ]
 [0.    0.    0.    0.   ]]
[[0.121764 0.140816 0.029556 0.010884]
 [0.       0.       0.       0.      ]
 [0.       0.       0.       0.      ]]
[[0.33333333]
 [0.        ]
 [0.        ]]
Processing Class 1
[[5.006 3.428 1.462 0.246]
 [5.936 2.77  4.26  1.326]
 [0.    0.    0.    0.   ]]
[[0.121764 0.140816 0.029556 0.010884]
 [0.261104 0.0965   0.2164   0.038324]
 [0.       0.       0.       0.      ]]
[[0.33333333]
 [0.33333333]
 [0.        ]]
Processing Class 2
[[5.006 3.428 1.462 0.246]
 [5.936 2.77  4.26  1.326]
 [6.588 2.974 5.552 2.026]]
[[0.121764 0.140816 0.029556 0.010884]
 [0.261104 0.0965   0.2164   0.038324]
 [0.396256 0.101924 0.298496 0.073924]]
[[0.33333333]
 [0.33333333]
 [0.33333333]]


TypeError: NaiveBayes.predict() takes 0 positional arguments but 1 was given