# Naive Bayes - Solutions

<hr style="clear:both">

**Author:** Sabri El Amrani

<hr style="clear:both">

In [None]:
# Function to align all tables to the left (useful for later on)

In [None]:
%%html
<style>
table {float:left}
</style>

### Imports

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

## 0. Intro

*__Background:__
As a data analyst focused on optimizing smart home technologies, you're tasked with understanding the factors that contribute to the efficiency and performance of smart home devices. You are working with a dataset that captures various metrics related to device usage, energy consumption, user behavior, and reliability. Your objective is to build a predictive model that classifies devices as either efficient or inefficient based on these features. Accurate classification will help improve smart home designs, enhance energy efficiency, and guide better user experience strategies.*

<img src="images/smart_home_device.png" style="width:500px"/>

[Source](https://www.iotevolutionworld.com/smart-home/articles/438532-how-secure-smart-home-devices-5-steps.htm)

## 1. Data loading & pre-processing

Let's start by preparing our data, using the following [dataset](https://www.kaggle.com/datasets/rabieelkharoua/predict-smart-home-device-efficiency-dataset) taken from Kaggle.

In [None]:
# In Pandas, a data table is called a DataFrame (abbreviated to df)
data = pd.read_csv('data/smart_home_device_usage_data.csv')

print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")
# Show the first 5 rows of the data
data.head(5)    

We can drop UserID as it is useless for predictions (you don't need to be able to perform dataframe operations like this one).

In [None]:
data_filtered = data.drop('UserID', axis=1)

Let's split the dataset into train/test set using sklearn's built-in `train_test_split` function.

In [None]:
X = data.drop('SmartHomeEfficiency', axis=1)
y = data['SmartHomeEfficiency']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Run the following cell to better understand what the preprocessed data looks like.

In [None]:
# Show shapes
print('Training set shape:')
print(f'X: {X_train.shape}, y: {y_train.shape}')

print('\nTest set shape:')
print(f'X: {X_test.shape}, y: {y_test.shape}')

## 2. Naive Bayes

We are now ready to dive into our classification task of the day.
As a reminder, the Naive Bayes Classifier revolves around Bayes' Theorem (illustrated below). Each data point is classified into the class with highest posterior probability according to Bayes.

<img src="images/formula.png" style="width:500px"/>

[Source](https://uc-r.github.io/naive_bayes)

### 2.1. Calculate Priors
- Compute the prior probability for each class, which is the proportion of each class in the training dataset.
- **Formula:**

 $$
   P(C_k) = \frac{N_k}{N} 
 $$

  where:
  - $ P(C_k) $ is the prior probability of class $ C_k $
  - $ N_k $ is the number of instances of class $ C_k $
  - $ N $ is the total number of instances

In [None]:
### START CODE HERE ###
# Expected output: array containing the prior for each class
# Hint: Use np.bincount()
class_counts = np.bincount(y_train)
class_priors = class_counts / len(y_train)
### END CODE HERE ###

### 2.2. Calculate Likelihoods

In [None]:
discrete_features = ['DeviceType', 'UserPreferences']
continuous_features = ['UsageHoursPerDay', 'EnergyConsumption', 'MalfunctionIncidents', 'DeviceAgeMonths']

- **Discrete Features:**
  - Calculate the likelihood for each value of the discrete features given each class.
  - **Formula:**

    $$
    P(X_i = x_i | C_k) = \frac{\text{Count}(X_i = x_i \land C_k)}{\text{Count}(C_k)}
    $$

    where:
    - $ P(X_i = x_i | C_k) $ is the likelihood of feature $ X_i $ taking value $ x_i $ given class $ C_k $
    - $\text{Count}(X_i = x_i \land C_k)$ is the count of instances where $ X_i = x_i $ and class is $ C_k $
    - $\text{Count}(C_k)$ is the count of instances of class $ C_k $

In [None]:
def calculate_discrete_likelihoods(X_train, y_train, feature, class_value):
    """
    Computes the likelihoods P(feature_value | class_value) for each unique value 
    of a discrete feature given a class in a Naive Bayes classifier.

    Args:
        X_train (pd.DataFrame): Training data with features.
        y_train (pd.Series): Target labels.
        feature (str): Feature name to calculate likelihoods for.
        class_value (any): Class value to condition on.

    Returns:
        dict: Likelihoods of each feature value given the class.
    """
    likelihoods = {}
    values = X_train[feature].unique()
    for value in values:
        ### START CODE HERE ###
        likelihoods[value] = ((X_train[feature] == value) & (y_train == class_value)).sum() / (y_train == class_value).sum()
        ### END CODE HERE ###
    return likelihoods


discrete_likelihoods = {}
for feature in discrete_features:
    discrete_likelihoods[feature] = {}
    for class_value in [0, 1]:
        discrete_likelihoods[feature][class_value] = calculate_discrete_likelihoods(X_train, y_train, feature, class_value)

- **Continuous Features:**
  - Assume Gaussian distribution for continuous features and calculate mean and variance for each feature given each class.
  - **Formula:**

    $$
    L(X_i = x_i | C_k) = \frac{1}{\sqrt{2\pi\sigma_k^2}} \exp\left( -\frac{(x_i - \mu_k)^2}{2\sigma_k^2} \right)
    $$

    where:
    - $ \mu_k $ and $ \sigma_k^2 $ are the mean and variance of feature $ X_i $ given class $ C_k $
    - $ L(X_i = x_i | C_k) $ is the likelihood of feature $ X_i $ taking value $ x_i $ given class $ C_k $

<div class="alert alert-info">
<strong>Important note</strong>

For a continuous $X_i$, the probability of $X_i$ being equal to any value is always 0. We highlight the fact that the likelihood is NOT a probability using the notation $ L(X_i = x_i | C_k) $, which evaluates the _probability density function_ at $x_i$ given class $C_k$.
    
</div>

In [None]:
def calculate_mean_var(X_train, y_train, feature):
    """
    Calculates the mean and variance of a feature for each class in a Naive Bayes classifier.

    Args:
        X_train (pd.DataFrame): Training data with features.
        y_train (pd.Series): Target labels.
        feature (str): Feature name to calculate mean and variance for.

    Returns:
        dict: Mean and variance of the feature for each class (0 and 1).
    """
    likelihoods = {}
    for class_value in [0, 1]:
        data = X_train[feature][y_train == class_value]
        ### START CODE HERE ###
        mean = data.mean()
        var = data.var()
        ### END CODE HERE ###
        likelihoods[class_value] = {'mean': mean, 'variance': var}
    return likelihoods

continuous_mean_var = {}
for feature in continuous_features:
    continuous_mean_var[feature] = calculate_mean_var(X_train, y_train, feature)

In [None]:
def calculate_continuous_likelihoods(x, mean, variance):
    """
    Computes the likelihood of a continuous feature using the Gaussian (normal) distribution.

    Args:
        x (float): The feature value.
        mean (float): The mean of the feature for the class.
        variance (float): The variance of the feature for the class.

    Returns:
        float: The likelihood of the feature value given the class.
    """
    ### START CODE HERE ###
    exponent = np.exp(-(x - mean) ** 2 / (2 * variance))
    return (1 / np.sqrt(2 * np.pi * variance)) * exponent
    ### END CODE HERE ###

### 2.3. Calculate Posterior Probabilities
- For a given instance, calculate the posterior probability for each class using Bayes' theorem.
- We have $n_d$ discrete features and $n_c$ continuous features.
- **Formula:**

$$
  P(C_k | X) \propto P(C_k) \prod_{i=1}^{n_d} P(X_i | C_k) \prod_{j=1}^{n_c} L(X_i | C_k)
 $$

 or equivalently (what we will implement here):

$$
 \text{log} \ P(C_k | X) \propto \text{log} \ P(C_k) + \sum_{i=1}^{n_d} \text{log} P(X_i | C_k) + \sum_{j=1}^{n_c} \text{log} L(X_i | C_k)
 $$

  where:
  - $ P(C_k | X) $ is the posterior probability of class $ C_k $ given the instance $ X $
  - $ P(C_k) $ is the prior probability of class $ C_k $
  - $ P(X_i | C_k) $ is the likelihood of discrete feature $ X_i $ given class $ C_k $
  - $ L(X_i | C_k) $ is the likelihood of continuous feature $ X_i $ given class $ C_k $

In [None]:
def compute_log_posteriors(row, class_priors, discrete_features, continuous_features, 
                           discrete_likelihoods, continuous_mean_var, calculate_continuous_likelihoods):
    """
    Computes the log posteriors for each class given a data row using both discrete and continuous features.

    Args:
        row (pd.Series): The data row to classify.
        class_priors (dict): Log prior probabilities for each class.
        discrete_features (list): List of discrete feature names.
        continuous_features (list): List of continuous feature names.
        discrete_likelihoods (dict): Likelihoods of discrete features given each class.
        continuous_mean_var (dict): Mean and variance of continuous features for each class.
        calculate_continuous_likelihoods (function): Function to compute likelihoods for continuous features.

    Returns:
        list: Log posterior probabilities for each class.
    """
    posteriors = []
    for class_value in [0, 1]:
        ### START CODE HERE ###
        # Add the log of the class priors to the class posteriors
        prior = np.log(class_priors[class_value])
        posterior = prior
        ### END CODE HERE ###

        # Discrete features
        for feature in discrete_features:
            value = row[feature]
            if value in discrete_likelihoods[feature][class_value]:
                ### START CODE HERE ###
                # Add the log of the discrete likelihoods to the class posteriors
                posterior += np.log(discrete_likelihoods[feature][class_value][value])
                ### END CODE HERE ###
            else:
                posterior += np.log(1e-6)  # Smoothing for unseen values

        # Continuous features
        for feature in continuous_features:
            value = row[feature]
            mean = continuous_mean_var[feature][class_value]['mean']
            variance = continuous_mean_var[feature][class_value]['variance']
            ### START CODE HERE ###
            # Add the log of the continuous likelihoods to the class posteriors
            posterior += np.log(calculate_continuous_likelihoods(value, mean, variance))
            ### END CODE HERE ###

        posteriors.append(posterior)
    return posteriors

### 2.4. Make Predictions
- Assign the class label with the highest posterior probability to the instance.
- **Formula:**

  $$
  \hat{y} = \arg\max_k P(C_k | X)
  $$

  or equivalently (what we will be using here for numerical stability):
  $$
  \hat{y} = \arg\max_k ( \log P(C_k | X) )
  $$

  where:
  - $ \hat{y} $ is the predicted class label

In [None]:
def predict(X_test, class_priors, discrete_features, continuous_features, 
            discrete_likelihoods, continuous_mean_var, calculate_continuous_likelihoods):
    """
    Predicts class labels for a test dataset using a Naive Bayes classifier with both 
    discrete and continuous features.

    Args:
        X_test (pd.DataFrame): Test dataset to classify.
        class_priors (dict): Prior probabilities for each class.
        discrete_features (list): List of discrete feature names.
        continuous_features (list): List of continuous feature names.
        discrete_likelihoods (dict): Likelihoods of discrete features given each class.
        continuous_mean_var (dict): Mean and variance of continuous features for each class.
        calculate_continuous_likelihoods (function): Function to compute likelihoods for continuous features.

    Returns:
        np.ndarray: Predicted class labels for the test dataset.
    """
    predictions = []
    for _, row in X_test.iterrows():
        # Compute log posteriors for each class
        posteriors = compute_log_posteriors(row, class_priors, discrete_features, continuous_features,
                                            discrete_likelihoods, continuous_mean_var, 
                                            calculate_continuous_likelihoods)
        ### START CODE HERE ###
        # Predict the class with the highest posterior
        predictions.append(np.argmax(posteriors))
        ### END CODE HERE ###
    return np.array(predictions)

# Making predictions on the test set
y_pred = predict(X_test, class_priors, discrete_features, continuous_features, 
                 discrete_likelihoods, continuous_mean_var, calculate_continuous_likelihoods)

## 3. Model Evaluation

Let's now evaluate our model using sklearn's built-in accuracy score report.

In [None]:
print(f'Accuracy: {accuracy_score(y_test, y_pred):.3f}')

Pretty good for a naive classifier right?

Thank you for your participation! In the next notebooks, you will learn to implement K-Nearest-Neighbors (KNN), an ubiquitous clustering technique.