# Naive Bayes Classifier

The Naive Bayes classifier is a popular probabilistic machine learning algorithm used for classification tasks, particularly in text classification and spam email detection. It's based on Bayes' theorem and assumes that features (independent variables) are conditionally independent of each other given the class label. Despite its "naive" assumption of independence, it often performs surprisingly well and is computationally efficient. The classifier calculates the probability of an input belonging to each class and selects the class with the highest probability as the prediction.

![Naive Bayes Classifier](https://miro.medium.com/v2/resize:fit:600/1*aFhOj7TdBIZir4keHMgHOw.png)

In [1]:
# Importing necessary libraries for the code
import numpy as np
import pandas as pd
import math as m
from sklearn.naive_bayes import GaussianNB

## Dataset used for testing the model

The dataset used for testing the model is a standard dataset known as IRIS dataset.  
The dataset can be found [here](https://archive.ics.uci.edu/dataset/53/iris) or in the my [Github Repository](https://github.com/Aditya-0911/Naive-Bayes-Classifier)

The Iris dataset is a renowned and extensively used dataset in the realms of machine learning and statistics. It was introduced by the British biologist and statistician Ronald A. Fisher in 1936. This dataset has become a cornerstone in the field, serving as a benchmark for various classification and clustering tasks. It stands as a fundamental resource for both students and researchers in the domains of data science and machine learning.

**Key Characteristics of the Iris Dataset:**

- **Dataset Type:** The Iris dataset is classified as a multivariate dataset, signifying that it contains measurements of multiple features or attributes for each individual sample.

- **Number of Classes:** The dataset encompasses three distinct classes, each representing a species of iris flowers: Setosa, Versicolor, and Virginica.

- **Features:** For each sample of an iris flower, the dataset includes four essential features: sepal length, sepal width, petal length, and petal width. These measurements are recorded in centimeters.

- **Sample Size:** The Iris dataset comprises a total of 150 samples, with an equal distribution of 50 samples for each of the three classes.

The Iris dataset's enduring significance lies in its simplicity and well-defined classes, making it an invaluable starting point for exploring various aspects of data analysis, preprocessing, and machine learning. It continues to be a timeless resource, aiding in the understanding of more complex datasets and machine learning challenges.

## Code Explanation

The provided Python code implements a Naive Bayes classifier for classification tasks using a user-defined dataset. Here's how it works:

1. The user inputs the filename of a CSV dataset and specifies the target variable (class label).

2. The script extracts the target variable values and allows the user to select a subset of columns (factors) to use as features for classification.

3. It splits the dataset into training and testing sets (80% training and 20% testing).

4. The Naive Bayes classifier is applied to the training dataset using a normal distribution probability function. It calculates the prior probabilities, likelihoods, and posterior probabilities for each class.

5. The script classifies the data points in the testing dataset and calculates the accuracy of the classification.

6. The accuracy of the Naive Bayes classifier is printed for both the training and testing datasets.

This code provides a basic implementation of a Naive Bayes classifier for educational purposes. For more advanced applications, consider using specialized libraries like scikit-learn, which offer a wide range of machine learning algorithms and additional features for model evaluation and tuning.


In [2]:
# Define a function for the Naive Bayes classifier
def devdas(df, y, inp):
    # Extract the target variable values
    a = df[y].to_list()
    
    # Check if the input specifies a range or a list of factors
    dec = ':'
    factor = []
    if dec in inp:
        # If the input contains a range (e.g., "0:5"), extract columns within the specified range
        x = inp.split(':')
        for i in range(int(x[0]), int(x[1]) + 1):
            factor.append(fac[i])
    else:
        # If the input contains a comma-separated list (e.g., "0,1,3,5"), extract specified columns
        x = inp.split(',')
        for i in x:
            factor.append(fac[int(i)])
    
    # Include the target variable in the list of factors
    factor.append(y)
    
    # Select the relevant columns from the DataFrame
    df = df[factor]
    
    # Sort the columns to ensure consistent order
    df = df.sort_index(axis=1)
    
    # Get the column names and calculate mean and standard deviation
    cols = list(df.columns)
    mean = df.pivot_table(index=y, values=cols, aggfunc=np.mean)
    std = df.pivot_table(index=y, values=cols, aggfunc=np.std)
    std = std.to_numpy()
    mean = mean.to_numpy()
    p = m.pi
    e = m.e
    
    # Define a function to calculate the normal distribution probability
    def normdist(x, m1, s1):
        k = 1 / ((pow(2 * p, 0.5)) * s1)
        prob = k * pow(e, -0.5 * pow((x - m1) / (s1), 2))
        return prob
    
    # Calculate the frequency of each class in the target variable
    f = dict()
    for i in range(len(a)):
        if a[i] in f:
            f[a[i]] += 1
        else:
            f[a[i]] = 1
    
    keys = list(f.keys())
    val = list(f.values())
    
    # Normalize class frequencies
    val = np.array(val)
    val = val / len(df)
    f = dict(zip(keys, val))
    
    # Calculate prior probabilities
    p1 = list(f.values())
    f = dict(sorted(f.items()))
    keys = list(f.keys())
    
    # Remove the target variable column from the DataFrame
    df.drop(y, axis=1, inplace=True)
    cols = list(df.columns)
    
    # Classify data for each class
    for w in keys:
        pro = []
        for i in range(len(df)):
            l = df.iloc[i].to_list()
            temp = 1
            for j in range(len(cols)):
                temp *= normdist(l[j], mean[keys.index(w), j], std[keys.index(w), j])
            pro.append(temp)
        pro = np.array(pro)
        pro = list(pro * f[w])
        df[w] = pro
    
    # Determine the predicted class for each data point
    m1 = []
    for i in range(len(df)):
        l = df.iloc[i].to_list()[len(cols):len(df.columns)]
        test = max(l)
        index = l.index(test)
        m1.append(keys[index])
    
    # Calculate and return accuracy
    count = 0
    for i in range(len(df)):
        if m1[i] == a[i]:
            count += 1
    return count / len(df)

# Input filename from the user
x = input("Enter your file name: ")

# Read data from the CSV file into a DataFrame
df = pd.read_csv(x)

# Get the column names
fac = list(df.columns)
print('Columns in the dataframe are:', fac)

# Input the target variable
y = input("Enter your target name: ")

# Select factors (features) for analysis
print('Select the factors (features):')
print('Enter the number in front of column names:')
for i in range(len(fac)):
    print(i, ' - ', fac[i])

inp = input('Enter the numbers: ')

# Split the data into training and testing sets
train = df.sample(frac=0.80)
test = df.drop(train.index)

# Apply the Naive Bayes classifier to training and testing datasets
trainp = devdas(train, y, inp)
testp = devdas(test, y, inp)

# Display accuracy for training and testing datasets
print("Naive Bayes accuracy for the training dataset:", trainp)
print("Naive Bayes accuracy for the testing dataset:", testp)


Enter your file name: iris.csv
Columns in the dataframe are: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
Enter your target name: species
Select the factors (features):
Enter the number in front of column names:
0  -  sepal_length
1  -  sepal_width
2  -  petal_length
3  -  petal_width
4  -  species
Enter the numbers: 0:3
Naive Bayes accuracy for the training dataset: 0.95
Naive Bayes accuracy for the testing dataset: 1.0


In [3]:
# splitting the train and test dataset into x_train,y_train and x_test,y_test
y_train=train.species.to_numpy()
x_train=train.drop('species',axis=1).to_numpy()
y_test=test.species.to_numpy()
x_test=test.drop('species',axis=1).to_numpy()

In [4]:
# Using sklearn library to verify our values

nb = GaussianNB()

In [5]:
# the accuracy from sklearn and our model are comming same
nb.fit(x_train,y_train)
print('Naive Bayes accuracy for train dataset using sklearn: ',nb.score(x_train,y_train))

Naive Bayes accuracy for train dataset using sklearn:  0.95


In [6]:
nb.fit(x_test,y_test)
print('Naive Bayes accuracy for test dataset using sklearn: ',nb.score(x_test,y_test))

Naive Bayes accuracy for test dataset using sklearn:  1.0
