# Naive Bayes Classifier
The NaiveBayesClassifier class is an implementation of the Naive Bayes algorithm for categorical target variables. It provides methods to fit the classifier to a training dataset, calculate class and conditional probabilities, and make predictions on a test dataset.

### Libraries



- `pandas`: It was used to read and process the dataset stored in a CSV file. pandas provides efficient data structures and functions for handling tabular data, making it suitable for working with datasets.

- `collections.defaultdict`: It was used to initialize the class_counts and conditional_counts dictionaries with default integer values. This allows counting and accumulating values in dictionaries without the need for explicit initialization.

- `train_test_split`: was used to divide the dataset into separate training and testing sets, allowing the model to be trained on a portion of the data and evaluated on unseen data.

In [1]:
import pandas as pd
from collections import defaultdict
from sklearn.model_selection import train_test_split

### Naive Bayes Classifier Class

In [2]:
class NaiveBayesClassifier:
    def __init__(self, laplace_smoothing=True):
        """
        Naive Bayes Classifier initialization.

        Parameters:
            laplace_smoothing (bool): Flag indicating whether to apply Laplace smoothing (default: False).
        """
        self.class_probabilities = {}
        self.conditional_probabilities = {}
        self.class_counts = defaultdict(int)
        self.conditional_counts = defaultdict(lambda: defaultdict(int))
        self.target_given_attribute = {}
        self.laplace_smoothing = laplace_smoothing
    
    def fit(self, X, y):
        """
        Fit the Naive Bayes Classifier to the training data.

        Parameters:
            X (pandas.DataFrame): Training features.
            y (pandas.Series): Target variable.

        Returns:
            None
        """
        self.calculate_class_probabilities(y)
        self.calculate_conditional_probabilities(X, y)
    
    def calculate_class_probabilities(self, y):
        """
        Calculate class probabilities.

        Parameters:
            y (pandas.Series): Target variable.

        Returns:
            None
        """
        total_examples = len(y)

        for value in y:
            self.class_counts[value] += 1

        for class_key, class_count in self.class_counts.items():
            self.class_probabilities[class_key] = class_count / total_examples
    
    def calculate_conditional_probabilities(self, X, y):
        """
        Calculate conditional probabilities with Laplace correction.

        Parameters:
            X (pandas.DataFrame): Training features.
            y (pandas.Series): Target variable.

        Returns:
            None
        """
        for index, row in X.iterrows():
            target_value = y[index]

            for column_name, column_value in row.items():
                conditional_key = str(column_value) + '|' + str(target_value)
                self.conditional_counts[column_name][conditional_key] += 1

                if self.laplace_smoothing:
                    unseen_key = '*unseen*' + '|' + str(target_value)
                    self.conditional_counts[column_name][unseen_key] += 1

        for column_name, conditional_count in self.conditional_counts.items():
            self.conditional_probabilities[column_name] = {}
            for conditional_key, count in conditional_count.items():
                class_key = conditional_key.split('|')[1]
                if self.laplace_smoothing:
                    self.conditional_probabilities[column_name][conditional_key] = (count + 1) / (self.class_counts[class_key] + 1)
                else:
                    self.conditional_probabilities[column_name][conditional_key] = count / self.class_counts[class_key]
    
    def predict(self, X):
        """
        Predict the target variable for new data.

        Parameters:
            X (pandas.DataFrame): New data features.

        Returns:
            list: Predicted target variable values.
        """
        predictions = []

        for _, row in X.iterrows():
            target_given_attribute = self.class_probabilities.copy()

            for column_name, value in row.items():
                for class_key in target_given_attribute:
                    conditional_key = str(value) + '|' + class_key

                    if conditional_key in self.conditional_probabilities[column_name]:
                        target_given_attribute[class_key] *= self.conditional_probabilities[column_name][conditional_key]

            predicted_class = max(target_given_attribute, key=target_given_attribute.get)
            predictions.append(predicted_class)
        
        self.target_given_attribute = target_given_attribute

        return predictions


### Helper Function

The *calculate_accuracy* function is used to measure the accuracy of predicted labels compared to the actual labels. It takes two lists of labels as input and returns the accuracy as a floating-point value between 0 and 1. The function counts the number of correct predictions and divides it by the total number of labels to calculate the accuracy.

In [3]:
def calculate_accuracy(actual_labels, predicted_labels):
    """
    Calculate the accuracy of predicted labels compared to actual labels.

    Args:
        actual_labels (list): The list of actual labels.
        predicted_labels (list): The list of predicted labels.

    Returns:
        float: The accuracy as a value between 0 and 1.
    """
    correct_count = 0
    total_count = len(actual_labels)

    for actual, predicted in zip(actual_labels, predicted_labels):
        if actual == predicted:
            correct_count += 1

    accuracy = correct_count / total_count
    return accuracy


### Execution

#### Test Dataset 

The dataset provided by the slides is a weather dataset that contains information about various weather conditions and whether or not outdoor activities can be played. It includes the following features:

- *Outlook*: Describes the outlook of the weather (Sunny, Overcast, Rainy).
- *Temp*: Represents the temperature (Hot, Mild, Cool).
- *Humidity*: Indicates the humidity level (High, Normal).
- *Windy*: Indicates whether it is windy or not (True, False).
- *Play*: Indicates whether outdoor activities can be played or not (Yes, No).

The dataset consists of 14 instances or samples, each representing a different weather scenario. It is a small dataset used to demonstrate the implementation of the Naive Bayes classifier. The goal is to predict the "Play" class based on the given weather conditions.

In [4]:
df = pd.read_csv('./data/dataset.csv')
df

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play
0,Sunny,Hot,High,False,No
1,Sunny,Hot,High,True,No
2,Overcast,Hot,High,False,Yes
3,Rainy,Mild,High,False,Yes
4,Rainy,Cool,Normal,False,Yes
5,Rainy,Cool,Normal,True,No
6,Overcast,Cool,Normal,True,Yes
7,Sunny,Mild,High,False,No
8,Sunny,Cool,Normal,False,Yes
9,Rainy,Mild,Normal,False,Yes


X contains all the columns except the 'Play' column, while y contains only the 'Play' column. This division allows us to train a machine learning model to predict the target variable based on the provided features.

In [5]:
X_train = df.drop('Play', axis=1)
y_train = df['Play']

We create a test dataset. This example is taken from the book and demonstrates how the trained Naive Bayes classifier can predict the outcome based on new, unseen data by using the predict method with the test_df as input.

In [6]:
test_data = {
    'Outlook': ['Sunny'],
    'Temp': ['Cool'],
    'Humidity': ['High'],
    'Windy': [True],
    'Play': 'No'
}

test_df = pd.DataFrame(test_data)

X_test = test_df.drop('Play', axis=1)



We use the Classifier by calling the *fit* function to train the model on the training data and then *predict* to make predictions on new data. It assumes that the features are conditionally independent given the target variable.

In [7]:
classifier = NaiveBayesClassifier(laplace_smoothing=False)
classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)

In [8]:
print('Target given all the attributes result: ', classifier.target_given_attribute)
print('Predictions:', predictions)

Target given all the attributes result:  {'No': 0.02057142857142857, 'Yes': 0.005291005291005291}
Predictions: ['No']


This code calculates the accuracy by comparing the actual labels from the 'Play' column of the test_df DataFrame with the predicted labels. The accuracy value is then printed.

In [9]:
actual_labels = test_df['Play'].tolist()
predicted_labels = predictions

accuracy = calculate_accuracy(actual_labels, predicted_labels)
print('Accuracy:', accuracy)


Accuracy: 1.0


#### Dataset 1

The dataset includes three columns:

- *Glucose*: Represents the measured glucose levels of individuals.
- *Blood Pressure*: Contains recorded blood pressure measurements of individuals.
- *Diabetes*: Indicates whether an individual has diabetes (0 for no, 1 for yes).

The dataset contains 3 columns and 995 records.

In [10]:
DiabetesPredictionDF = pd.read_csv('./data/diabetes.csv')
DiabetesPredictionDF


Unnamed: 0,glucose,bloodpressure,diabetes
0,40,85,0
1,40,92,0
2,45,63,1
3,45,80,0
4,40,73,1
...,...,...,...
990,45,87,0
991,40,83,0
992,40,83,0
993,40,60,1


##### Data Preprocessing

1. Target Mapping -  This mapping provides a more descriptive representation of the diabetes status in the dataset. The 0 and 1 will throw error in the implemented code for Naive Bayes.

In [11]:
diabetes_mapping = {0: "no", 1: "yes"}
DiabetesPredictionDF["diabetes"] = DiabetesPredictionDF["diabetes"].map(diabetes_mapping)

2. Binning - Binning was used to convert continuous variables into categories, making the data easier to understand and analyze.

In [12]:
glucose_bins = [0, 80, 100, 125, float('inf')]
glucose_labels = ['low', 'normal', 'pre-diabetic', 'diabetic']

DiabetesPredictionDF['glucose'] = pd.cut(DiabetesPredictionDF['glucose'], bins=glucose_bins, labels=glucose_labels)

bloodpressure_bins = [0, 80, 90, 120, float('inf')]
bloodpressure_labels = ['normal', 'elevated', 'high', 'very high']

DiabetesPredictionDF['bloodpressure'] = pd.cut(DiabetesPredictionDF['bloodpressure'], bins=bloodpressure_bins, labels=bloodpressure_labels)



##### Model Training

In [13]:
X_2 = DiabetesPredictionDF.drop("diabetes", axis=1)  
y_2 = DiabetesPredictionDF["diabetes"]  

X_2_train, X_2_test, y_2_train, y_2_test = train_test_split(X_2, y_2, test_size=0.2, random_state=42)


In [14]:
classifier_2 = NaiveBayesClassifier()
classifier_2.fit(X_2_train, y_2_train)

##### Model Testing

In [15]:
predictions = classifier_2.predict(X_2_test)

In [16]:
actual_labels = y_2_test.tolist()
predicted_labels = predictions

accuracy = calculate_accuracy(actual_labels, predicted_labels)
print('Accuracy:', accuracy)

Accuracy: 0.8341708542713567


#### Dataset 2

The dataset contains information about job-related attributes and salary details. Here's a revised description of each attribute:

- *work_year*: Represents the specific year(s) in which individuals have been employed, such as 2022, 2023, and so on.
- *experience_level*: Indicates the level of experience of individuals, such as entry-level, mid-level, or senior-level.
- *employment_type*: Specifies the type of employment, such as full-time, part-time, or contract.
- *job_title*: Refers to the job title or position held by individuals.
- *salary*: Represents the salary amount earned by individuals.
- *salary_currency*: Indicates the currency in which the salary is recorded.
- *salary_in_usd*: Represents the salary amount converted to US dollars (USD) for standardized comparison.
- *employee_residence*: Specifies the location of the employee's residence.
- *remote_ratio*: Indicates the ratio or percentage of remote work offered in the job.
- *company_location*: Refers to the location of the company or organization.
- *company_size*: Indicates the size of the company, such as small, medium, or large.

The dataset has 11 columns and 3755 records. 

In [17]:
DataScienceSalariesDF = pd.read_csv('./data/ds_salaries.csv')
DataScienceSalariesDF

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M
...,...,...,...,...,...,...,...,...,...,...,...
3750,2020,SE,FT,Data Scientist,412000,USD,412000,US,100,US,L
3751,2021,MI,FT,Principal Data Scientist,151000,USD,151000,US,100,US,L
3752,2020,EN,FT,Data Scientist,105000,USD,105000,US,100,US,S
3753,2020,EN,CT,Business Data Analyst,100000,USD,100000,US,100,US,L


##### Data Preprocessing

1. We drop salary and salary currency because this column is more accurately summed up in the *salary_in_usd* column. 

In [18]:
columns_to_drop = ['salary', 'salary_currency']
DataScienceSalariesDF = DataScienceSalariesDF.drop(columns=columns_to_drop)

2. We apply binning to the target class because the values are continuous. 

In [19]:
salary_bins = [0, 50000, 75000, 100000, 125000, 150000, float('inf')]
salary_labels = ['<50k', '50k-75k', '75k-100k', '100k-125k', '125k-150k', '>150k']

DataScienceSalariesDF['salary_in_usd'] = pd.cut(DataScienceSalariesDF['salary_in_usd'], bins=salary_bins, labels=salary_labels)


##### Model Training

In [20]:
X_3 = DataScienceSalariesDF.drop('salary_in_usd', axis=1)  
y_3 = DataScienceSalariesDF['salary_in_usd']  

X_3_train, X_3_test, y_3_train, y_3_test = train_test_split(X_3, y_3, test_size=0.2, random_state=42)

In [21]:
classifier_3 = NaiveBayesClassifier()
classifier_3.fit(X_3_train, y_3_train)

##### Model Testing

In [22]:
predictions = classifier_3.predict(X_3_test)

In [23]:
actual_labels = y_3_test.tolist()
predicted_labels = predictions

accuracy = calculate_accuracy(actual_labels, predicted_labels)
print('Accuracy:', accuracy)

Accuracy: 0.3262316910785619
