In [1]:
# Enable code formatting using external plugin: nb_black.
%reload_ext nb_black

<IPython.core.display.Javascript object>

# Naive Bayes Classifier

## Index

1. [Setup](#[1]-Setup)
2. [Data Analysis](#[2]-Data-Analysis)
3. [Implement Naive Bayes](#[3]-Implement-Naive-Bayes)
4. [Training Model](#[4]-Training-Model)
5. [Testing Model](#[5]-Testing-Model)
6. [Testing Laplace Smoothing (Additive Smoothing)](#[6]-Testing-Additive-Smoothing)

## Objective

1. Implement simple **Naive Bayes classifier for categorical features**.
2. **Train and test model** on sample weather forecast dataset.
3. Using the model predict if its ideal to **play Tennis outdoors**.

<a id="[1]-Setup"></a>
# [1] Setup

### Import and configure required libraries

In [2]:
# Data manipulation libraries
import pandas as pd

# Data visualization libraries
import prettytable
from prettytable import PrettyTable

# General imports
import math

# Library versions used in below EDA.
print("Pandas version:", pd.__version__)
print("PrettyTable version:", prettytable.__version__)

# Configure Pandas.
# Set display width to maximum 130 characters in the output, post which it will continue in next line.
pd.options.display.width = 130

Pandas version: 1.4.2
PrettyTable version: 3.3.0


<IPython.core.display.Javascript object>

#### Common functions

In [3]:
def is_empty(element) -> bool:
    """
    Function to check if input `element` is empty.

    Other than some special exclusions and inclusions,
    this function returns boolean result of Falsy check.
    """
    if (isinstance(element, int) or isinstance(element, float)) and element == 0:
        # Exclude 0 and 0.0 from the Falsy set.
        return False
    elif isinstance(element, str) and len(element.strip()) == 0:
        # Include string with only one or more empty space(s) into Falsy set.
        return True
    elif isinstance(element, bool):
        # Exclude False from the Falsy set.
        return False
    else:
        # Falsy check.
        return False if element else True


def get_count(items, get_key=lambda item: item):
    """
    Function to count `key` in a list of items.
    """
    count = {}
    for index, item in enumerate(items):
        if is_empty(get_key(item)):
            raise ValueError(f"Specified key not found in the item at index: {index} in the list.")

        count[get_key(item)] = count.get(get_key(item), 0) + 1

    return count.get(True, count)


ENABLE_LOG = False


def text(*args):
    """
    Function to print() input string when logging is enabled.
    """
    if ENABLE_LOG is True:
        print(*args)


def title(title_str, padding=[1, 1], line_style="="):
    """
    Function to print() input string with some styles, when logging is enabled.
    """
    if ENABLE_LOG is True:
        pad_top, pad_bot = padding
        pt = "\n" * pad_top
        pb = "\n" * pad_bot
        print(pt + title_str + "\n" + line_style * len(title_str) + pb)

<IPython.core.display.Javascript object>

#### Load data-points from the `.csv` file

In [4]:
wthr_df = pd.read_csv("../input/simple-weather-forecast/weather_forecast.csv")
wthr_df.head(14)

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Play
0,Sunny,Hot,High,Weak,No
1,Sunny,Hot,High,Strong,No
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
5,Rain,Cool,Normal,Strong,No
6,Overcast,Cool,Normal,Strong,Yes
7,Sunny,Mild,High,Weak,No
8,Sunny,Cool,Normal,Weak,Yes
9,Rain,Mild,Normal,Weak,Yes


<IPython.core.display.Javascript object>

<a id="[2]-Data-Analysis"></a>
# [2] Data Analysis

In [5]:
rows, cols = wthr_df.shape
print("Rows:", rows, "Columns:", cols)

Rows: 14 Columns: 5


<IPython.core.display.Javascript object>

`DataFrame` metadata:

In [6]:
wthr_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Outlook      14 non-null     object
 1   Temperature  14 non-null     object
 2   Humidity     14 non-null     object
 3   Windy        14 non-null     object
 4   Play         14 non-null     object
dtypes: object(5)
memory usage: 688.0+ bytes


<IPython.core.display.Javascript object>

**Observations**

1. Dataset contains **14 rows and 5 columns**.
2. All five columns in 14 rows have `non-null` values. **No missing data**.
3. All the column contain data in **string data-type**.
4. All columns are **categorical features**.

#### Distinct values for each feature

In [7]:
table = PrettyTable(["Features", "Distinct Values", "Count"], align="l")

for feature in wthr_df.columns[0:-1]:
    unq_vals = wthr_df[feature].unique().tolist()
    table.add_row([feature, ", ".join(unq_vals), len(unq_vals)])

print(table)

+-------------+-----------------------+-------+
| Features    | Distinct Values       | Count |
+-------------+-----------------------+-------+
| Outlook     | Sunny, Overcast, Rain | 3     |
| Temperature | Hot, Mild, Cool       | 3     |
| Humidity    | High, Normal          | 2     |
| Windy       | Weak, Strong          | 2     |
+-------------+-----------------------+-------+


<IPython.core.display.Javascript object>

#### Class label

In [8]:
wthr_df["Play"].value_counts()

Yes    9
No     5
Name: Play, dtype: int64

<IPython.core.display.Javascript object>

**Observations**

1. Dataset is **slightly imbalanced**.
2. Dataset contains **64% positive data-points** i.e., Play == "Yes". Positive data-points are **Majority class**.
3. Dataset contains **36% negative data-points** i.e., Play == "No". Negative data-points are **Minority class**.

### How Imbalanced dataset affects Naive Bayes?

1. Minority class will get very **small Prior** value to begin with.
2. **Laplace Smoothing** gives **undue attention** to Minority class.

#### Solution

Solution to above problem is either **Up-sampling** or **Down-sampling**. In this case Up-sampling is better than Down-sampling since the dataset is too small.

### Up-sampling the Minority class

Up-sampling dataset by duplicating last four rows in the dataset.

In [9]:
# Filter to fetch all minority class values.
fltr = wthr_df["Play"] == "No"

# Apply filter and slice out last four rows.
last_four_rows = wthr_df.loc[fltr][-4:]

# Append last four rows of original DataFrame into a new DataFrame.
wthr_dfb = pd.concat([wthr_df, last_four_rows], ignore_index=True)

# Sort by `Play` column to check the concatenation.
wthr_dfb.sort_values(by=["Play"], ascending=False)

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Play
9,Rain,Mild,Normal,Weak,Yes
2,Overcast,Hot,High,Weak,Yes
3,Rain,Mild,High,Weak,Yes
4,Rain,Cool,Normal,Weak,Yes
6,Overcast,Cool,Normal,Strong,Yes
8,Sunny,Cool,Normal,Weak,Yes
10,Sunny,Mild,Normal,Strong,Yes
11,Overcast,Mild,High,Strong,Yes
12,Overcast,Hot,Normal,Weak,Yes
13,Rain,Mild,High,Strong,No


<IPython.core.display.Javascript object>

In [10]:
# Verify if dataset is balanced.
wthr_dfb["Play"].value_counts()

No     9
Yes    9
Name: Play, dtype: int64

<IPython.core.display.Javascript object>

Now the dataset is balanced!

<a id="[3]-Implement-Naive-Bayes"></a>
# [3] Implement Naive Bayes

Implement Naive Bayes with below features:

1. Integrate Laplace Smoothing.
2. Avoid numerical underflow.
3. Show Feature Importance & Model Interpretability.
4. Log model training.

In [11]:
def NaiveBayes(df):
    """
    Closure to implement simple Naive Bayes Classifier.
    """
    N, cols = df.shape  # N: Number of data-points, columns

    # Split features and class-label.
    f_names = df.columns[0:-1]
    l_name = df.columns[-1]
    labels = df[l_name].value_counts().to_dict()

    # Calculate class-label Prior.
    cl_p = {label: count / N for label, count in labels.items()}

    def fit(K, alpha=1):
        """
        Train model using every data-point based on Naive Bayes technique.
        """

        def laplace_smoothing(a, b, k, sep=False):
            """
            Function to compute Laplace Smoothing.
            """
            return ((a + alpha), (b + (alpha * k))) if sep else ((a + alpha) / (b + (alpha * k)))

        # Print class-label Prior value.
        title("Prior:")
        [text(f"P({label}) = {count}/{N}") for label, count in labels.items()]

        # Build and print Likelihood table.
        title("Likelihood with Laplace Smoothing:", [2, 0])
        model = {}
        for f_idx, f_name in enumerate(f_names):
            # For each feature calculate conditional probability.
            title(f'Feature: "{f_name}" & label: {list(labels.keys())}', [1, 0], "-")

            vals = df[[f_name, l_name]].values
            f_count = get_count(vals, lambda item: item[0] + item[1])

            # Calculate probability of feature = `feature` | class-label == `label`.
            for feature in df[f_name].unique().tolist():
                for label, count in labels.items():
                    aib = f_count.get(feature + label, 0)
                    nmtr, dnomtr = laplace_smoothing(aib, count, K[f_idx], sep=True)
                    model[(f_name, feature, label)] = nmtr / dnomtr
                    text(f"P({f_name} = {feature} | {l_name} == {label}) = {nmtr}/{dnomtr}")

        text("\nModel training complete!")

        def sort_order(item):
            """
            Sort model my label and probability in descending order.
            """
            (f_name, f_value, label), prob = item
            return (label, prob)

        # Sort model to observe Feature Importance & Interpretability.
        s_model = sorted(model.items(), key=sort_order, reverse=True)  # Sorted model.

        title("Feature Importance & Interpretability", [2, 0])
        table = PrettyTable(["Sl. No.", "Ft Name", "Ft Value", "Label", "Prob"], align="r")
        for idx, l in enumerate(s_model):
            (f_name, f_val, label), prob = l
            table.add_row([idx + 1, f_name, f_val, label, round(prob, 4)])
        text(table)

        def predict(x_qs):
            """
            Function to compute Posterior using Prior and Likelihood.
            """
            sigma_cl = {}  # Calculate sigma per class-label.
            for label, count in labels.items():
                sigma = math.log(cl_p[label])  # Initialize `sigma` with Prior.
                for idx, x_q in enumerate(x_qs):
                    # If model receives an unseen feature value then use Laplace Smoothing.
                    cp = model.get((f_names[idx], x_q, label), laplace_smoothing(0, count, K[idx]))

                    # Log probabilities for numerical stability.
                    sigma += math.log(cp)

                sigma_cl[label] = round(sigma, 4)
                text(f"P({l_name} = {label} | {x_qs}) = {sigma_cl[label]}")

            # Compare log-prob and return largest value as `y_q`.
            return max(sigma_cl, key=lambda item: sigma_cl[item])

        return predict

    return fit

<IPython.core.display.Javascript object>

> Note: 
> Above implementation is verified using [`CategoricalNB`][1] class of **scikit-learn** library. 
> Check [verification code][2] in Github.

[1]: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn-naive-bayes-categoricalnb
[2]: https://github.com/DheemanthBhat/ML-Concepts/blob/4fc1e9c33f4ddb8abfa381dd97bf9835aa9cedd6/4.%20Naive%20Bayes/Naive%20Bayes%20Classifier%20-%20CategoricalNB.ipynb

<a id="[4]-Training-Model"></a>
# [4] Training Model

#### Initialize classifier

In [12]:
fit = NaiveBayes(wthr_dfb)

<IPython.core.display.Javascript object>

Train model using training dataset.

> Note: Set `ENABLE_LOG` to `False` to train model silently (without logs).

#### Number of available categories for each feature

| Features    | Number of categories  | K |
|-------------|-----------------------|---|
| Outlook     | Sunny, Overcast, Rain | 3 |
| Temperature | Hot, Mild, Cool       | 3 |
| Humidity    | Low, Normal, High     | 3 |
| Windy       | Weak, Strong          | 2 |

In [13]:
ENABLE_LOG = True
predict = fit(K=[3, 3, 3, 2])


Prior:

P(No) = 9/18
P(Yes) = 9/18


Likelihood with Laplace Smoothing:

Feature: "Outlook" & label: ['No', 'Yes']
-----------------------------------------
P(Outlook = Sunny | Play == No) = 6/12
P(Outlook = Sunny | Play == Yes) = 3/12
P(Outlook = Overcast | Play == No) = 1/12
P(Outlook = Overcast | Play == Yes) = 5/12
P(Outlook = Rain | Play == No) = 5/12
P(Outlook = Rain | Play == Yes) = 4/12

Feature: "Temperature" & label: ['No', 'Yes']
---------------------------------------------
P(Temperature = Hot | Play == No) = 4/12
P(Temperature = Hot | Play == Yes) = 3/12
P(Temperature = Mild | Play == No) = 5/12
P(Temperature = Mild | Play == Yes) = 5/12
P(Temperature = Cool | Play == No) = 3/12
P(Temperature = Cool | Play == Yes) = 4/12

Feature: "Humidity" & label: ['No', 'Yes']
------------------------------------------
P(Humidity = High | Play == No) = 8/12
P(Humidity = High | Play == Yes) = 4/12
P(Humidity = Normal | Play == No) = 3/12
P(Humidity = Normal | Play == Yes) = 7/12

Featu

<IPython.core.display.Javascript object>

---

**Observations**

Features like _Windy_, _Humidity_ and _Temperature_ are important features respectively to predict if its an ideal weather condition to play Tennis outdoors. 

Model has learnt below amazing knowledge using likelihood probability:

2. Feature _Windy_ has high probability values as seen in:
    * **Row 1**: When wind is _Weak_ we can play outdoors.
    * **Row 12**: When wind is _Strong_ we cannot play outdoors.
1. Feature _Humidity_ has high probability values as seen in:
    * **Row 2**: When humidity is _Normal_ we can play outdoors.
    * **Row 11**: When humidity is _High_ we cannot play outdoors.
3. Feature _Temperature_ has high probability values as seen in:
    * **Row 7**: When temperature is _Cool_ we can play outdoors.
    * **Row 17**: When temperature is _Hot_ we cannot play outdoors.
    * **Row 4 & 15**: When temperature is _Mild_, it depends on other features to decide if we can play outdoors.
4. Clearly feature _Outlook_ has least importance as seen in:
    * **Row 6**: _Outlook_ is _Rain_ but we can Play outdoors (label says "Yes").
    * **Row 13**: _Outlook_ is _Sunny_ but we cannot Play outdoors (label says "No"). 

<a id="[5]-Testing-Model"></a>
# [5] Testing Model

In [14]:
ENABLE_LOG = False

# Features: Outlook, Temperature, Humidity, Windy.
x_q = ["Sunny", "Cool", "High", "Strong"]  # Query point
y_q = predict(x_q)

print("Query point:", x_q)
print("Output:", y_q)

Query point: ['Sunny', 'Cool', 'High', 'Strong']
Output: No


<IPython.core.display.Javascript object>

<a id="[6]-Testing-Additive-Smoothing"></a>
# [6] Testing Additive Smoothing

Feature `Humidity` contains only `[High, Normal]` values in training dataset. Pass a new value `Low`, not seen in training dataset to check Laplace Smoothing. 

In [15]:
ENABLE_LOG = True

# Features: Outlook, Temperature, Humidity, Windy.
x_q = ["Sunny", "Cool", "Low", "Strong"]  # Query point
y_q = predict(x_q)

print("Query point:", x_q)
print("Output:", y_q)

P(Play = No | ['Sunny', 'Cool', 'Low', 'Strong']) = -5.7095
P(Play = Yes | ['Sunny', 'Cool', 'Low', 'Strong']) = -6.6746
Query point: ['Sunny', 'Cool', 'Low', 'Strong']
Output: No


<IPython.core.display.Javascript object>

It is _sunny_, temperature is _cool_, humidity is _low_ but wind is _Strong_ hence we cannot play outdoors.  

Now if wind is _weak_ its an ideal situation to play outdoors, so lets check the model prediction when weather is _"Sunny", "Cool", "Low", "Weak"_:

In [16]:
# Features: Outlook, Temperature, Humidity, Windy.
x_q = ["Sunny", "Cool", "Low", "Weak"]  # Query point
y_q = predict(x_q)

print("Query point:", x_q)
print("Output:", y_q)

P(Play = No | ['Sunny', 'Cool', 'Low', 'Weak']) = -6.2691
P(Play = Yes | ['Sunny', 'Cool', 'Low', 'Weak']) = -6.1149
Query point: ['Sunny', 'Cool', 'Low', 'Weak']
Output: Yes


<IPython.core.display.Javascript object>

Well indeed its an ideal weather condition to play outdoors!