# 4. Classification

This JupyterNotebook is part of an exercise series titled *Classification* based on the lecture of the same title.

This exercise series is divided into three parts. There will be one exercise session per part (= one part per week):

- **4.1.** [Decision Tree](./4.1.-Decision-Tree.ipynb) (*last weeks notebook*)
- **4.2.** Naive Bayes (*this notebook*)
    - **4.2.1.** [Dataset](#4.2.1.-Dataset) 
    - **4.2.2.** [Train Your Categorical Naive Bayes](#4.2.2.-Train-Your-Categorical-Naive-Bayes)
        - **4.2.2.1.** [Prior Probability](#4.2.2.1.-Prior-Probability)
        - **4.2.2.2.** [Likelihood](#4.2.2.2.-Likelihood)
    - **4.2.3.** [Test Your Categorical Naive Bayes](#4.2.3.-Test-Your-Categorical-Naive-Bayes)  
- **4.3.** AdaBoost (*next weeks notebook*) - *Will be uploaded at a later date as a separate zip-file*

<div class="alert alert-block alert-warning">

**Important:**
    
Work on the respective part yourself **BEFORE** each exercise session. The exercise session is **NOT** intended to take a first look at the exercise sheet, but to solve problems students had while preparing the exercise sheet beforehand.
    
</div>

**Importing Libraries**

Feel free to import more libraries here.

In [None]:
import pandas as pd
from typing import List, Any

## 4.2. (Categorical) Naive Bayes

Your task in this second exercise part will be to implement (categorical) naive Bayes from scratch. 

Recall Baye's Theorem from slides 45 - 48 and the example on slides 49 - 51 of our lecture.

### 4.2.1. Dataset 
We will use the following dataset in this JupyterNotebook:

In [None]:
from datasets.buys_computer import train_buys_computer

# view dataset
train_buys_computer

<div class="alert alert-info" role="alert">

**Task 1:**
    
Implement Categorical Naive Bayes 
    
</div>    

**Your task is to implement some functions in the following `CategociralNaiveBayes` object: `_calculate_prior_probability_per_class`, `_calculate_likelihood`, and `predict`.**

- `_calculate_prior_probability_per_class` calculates the probability that any given tuple has a specific class. For instance in the buys computer example dataset the likelihood will return the probability that any customer will buy a PC.
- `_calculate_likelihood` calculates the probability that a specific tuple with specific properties is associated to a specific class. In the buys computer example the likelihood returns the probability that a specific customer buys a PC.
- `predict` will use the likelihood and prior probability per class to calculate the probability per class label and return the class that is most probable.

We alredy implemented the `fit` function for you. 

<div class="alert alert-danger" role="alert">

**Note these additional requirements:**
- For the time being, we restrict our naive Bayes to work solely with **categorical values**.


</div>

In [None]:
class CategoricalNaiveBayes:
    def __init__(self) -> None:
        self.n: int = None
        # Function fit will later populate this variable
        self.prior: dict = None
        # Function fit will later populate this variable
        self.likelihood: dict = None
        # Function fit will later populate this variable
        self.target_attribute: str = None

    def fit(self, dataset: pd.DataFrame, target_attribute: str) -> None:
        """Fits naive Bayes for categorical data, that is: it calculates all necessary probabilities."""
        # Store number of tuples for later
        self.n = dataset.shape[0]
        # Store target attribute for later
        self.target_attribute = target_attribute
        # Calculate prior probability per class
        self._calculate_prior_probability_per_class(dataset=dataset)
        # Calculate likelihood
        self._calculate_likelihood(dataset=dataset)

    def _calculate_prior_probability_per_class(self, dataset: pd.DataFrame) -> None:
        """Private method to calculate prior probability for each class for a given dataset."""
        raise NotImplementedError(
            "Implement this function and store the result in self.prior."
        )

    def _calculate_likelihood(self, dataset: pd.DataFrame) -> None:
        """Private method to calculate likelihood for a given dataset."""
        raise NotImplementedError(
            "Implement this function and store the result in self.likelihood."
        )

    def predict(self, dataset: pd.DataFrame) -> List[Any]:
        """Returns predicted values for a given dataset."""
        raise NotImplementedError("Implement this function.")

In [None]:
from functools import reduce


class CategoricalNaiveBayes:
    def __init__(self) -> None:
        self.n: int = None
        # Function fit will later populate this variable
        self.prior: dict = None
        # Function fit will later populate this variable
        self.likelihood: dict = None
        # Function fit will later populate this variable
        self.target_attribute: str = None

    def fit(self, dataset: pd.DataFrame, target_attribute: str) -> None:
        """Fits naive Bayes for categorical data, that is: it calculates all necessary probabilities."""
        # Store number of tuples for later
        self.n = dataset.shape[0]
        # Store target attribute for later
        self.target_attribute = target_attribute
        # Calculate prior probability per class
        self._calculate_prior_probability_per_class(dataset=dataset)
        # Calculate likelihood
        self._calculate_likelihood(dataset=dataset)

    def _calculate_prior_probability_per_class(self, dataset: pd.DataFrame) -> None:
        """Private method to calculate prior probability for each class for a given dataset."""
        # Here, we build a dictionary that holds the class labels as keys
        # and the corresponding prior probability as values.
        self.prior = {
            class_label: dataset[dataset[self.target_attribute] == class_label].shape[0]
            / self.n
            for class_label in dataset[self.target_attribute].unique()
        }
        # Above dictionary comprehension could also be written as:
        # self.prior = dict()
        # for class_label in dataset[self.target_attribute].unique():
        #     current_n = dataset[dataset[self.target_attribute] == class_label].shape[0]
        #     self.prior[class_label] = current_n / self.n

    def _calculate_likelihood(self, dataset: pd.DataFrame) -> None:
        """Private method to calculate likelihood for a given dataset."""
        # Here we build a nested dictionary.
        # The outer most dictionary contains the class labels as keys.
        # For each class_label key, we store a dictionary for each attribute
        # (the attribute name is the key here).
        # Essentially, for each class label, attribute, and unique discrete value,
        # we store the likelihood.
        # Example of the resulting dictionary:
        # {
        #     "no": {
        #          age": {"<=30": 0.6, "31-40": 0.0, ">40": 0.4},
        #          <other attributes here>
        #     },
        #     "yes": {
        #         "age": {
        #             "<=30": 0.2222222222222222,
        #             "31-40": 0.4444444444444444,
        #             ">40": 0.3333333333333333,
        #         },
        #         <other attributes here>
        #     },
        # }
        self.likelihood = {
            class_label: {
                attribute_name: {
                    attribute_value: dataset.loc[
                        (dataset[self.target_attribute] == class_label)
                        & (dataset[attribute_name] == attribute_value)
                    ].shape[0]
                    / dataset[dataset[self.target_attribute] == class_label].shape[0]
                    for attribute_value in dataset[attribute_name].unique()
                }
                for attribute_name in dataset.columns
                if attribute_name != self.target_attribute
            }
            for class_label in dataset[self.target_attribute].unique()
        }
        # Above dictionary dictionary could also be written as follows: (You may want to uncomment the lines)
        # self.likelihood = dict()
        # # Iterate over all class labels
        # for class_label in dataset[target_attribute].unique():
        #     # Create an empty dictionary
        #     current_class_label_dictionary = dict()
        #     # Calculate number of tuples in this class
        #     current_number_of_tuples_per_class = dataset[
        #         dataset[self.target_attribute] == class_label
        #     ].shape[0]

        #     # Iterate over all attribute names in our dataset.
        #     for attribute_name in dataset.columns:
        #         if attribute_name != self.target_attribute:
        #             # If current attribute name is not equal to the target attribute.
        #             # Create an empty dictionary
        #             current_unique_attribute_values = dict()
        #             # Iterate over all unique attribute values of the current attribute.
        #             for attribute_value in dataset[attribute_name].unique():
        #                 # Calculate the number of tuples that have the current class label and attribute value.
        #                 current_number_of_tuples = dataset.loc[
        #                     (dataset[self.target_attribute] == class_label)
        #                     & (dataset[attribute_name] == attribute_value)
        #                 ].shape[0]
        #                 # Calculate the likelihood and
        #                 # add this number to the dictionary with the attribute value as the key.
        #                 current_unique_attribute_values[attribute_value] = (
        #                     current_number_of_tuples
        #                     / current_number_of_tuples_per_class
        #                 )
        #             # Add the dictionary that contains the likelihoods for each unique attribute value
        #             # and current class label to the dictionary with the attribute name as key.
        #             current_class_label_dictionary[
        #                 attribute_name
        #             ] = current_unique_attribute_values
        #     # Add the dictionary that contains the likelihoods for each attribute name and
        #     # their corresponding unique discrete value's likelihood to another dictionary with the class label as key.
        #     self.likelihood[class_label] = current_class_label_dictionary

    def predict(self, dataset: pd.DataFrame) -> List[Any]:
        """Returns predicted values for a given dataset."""
        # Predicting works by computing the likelihood for each attribute for
        prediction = [
            {
                class_label: (
                    self.prior[class_label]
                    * reduce(
                        (lambda x, y: x * y),
                        [
                            attribute_likelihoods[row[attribute]]
                            for attribute, attribute_likelihoods in likelihoods.items()
                        ],
                    )
                )
                for class_label, likelihoods in self.likelihood.items()
            }
            for _, row in dataset.iterrows()
        ]
        # Above list comprehension could be written as:
        # prediction = []
        # # Iterate over each row in our test dataset
        # for _, row in dataset.iterrows():
        #     # Create an empty dictionary that later stores the class probabilities for this current tuple/row
        #     current_tuple_probabilities = dict()

        #     # Iterate over all class labels and corresponding likelihood dictionaries in our likelihood class variable
        #     for class_label, likelihoods in self.likelihood.items():
        #         current_prior = self.prior[class_label]

        #         # Create an empty list to store the likelihood for each attribute
        #         current_likelihoods = []

        #         # Alternatively to the list:
        #         # Creating a list forces us to multiply all elements, we could instead calculate the likelihood as we iterate
        #         # over the elements. Thus, we could initialize a variable with 1.
        #         current_likelihood = 1

        #         # Iterate over each attribute and its corresponding likelihood
        #         for attribute, attribute_likelihoods in likelihoods.items():
        #             # Add only the likelihood of the corresponding value we observe in our test dataset to our likelihood list
        #             current_likelihoods.append(attribute_likelihoods[row[attribute]])

        #             # Alternatively, calculate on the fly:
        #             current_likelihood *= attribute_likelihoods[row[attribute]]

        #         # When using a list to store likelihoods we need to multiply each element.
        #         # To achieve this we can use the function reduce with an anonymous function (lambda function) that multiplies two elements.
        #         # The result will be a single number.
        #         current_likelihood = reduce((lambda x, y: x * y), current_likelihoods)

        #         # Calculate the probability and add the result to our tuple probability dictionary
        #         current_tuple_probabilities[class_label] = current_prior * current_likelihood

        #     # Add the current tuple probabilities for each class to the list of predictions
        #     prediction.append(current_tuple_probabilities)

        # We now have a list that contains for each tuple a dictionary. Each dictionary holds the probabilities
        # for each class of the tuple at this list position. Yet we want to return only the class label for each
        # tuple in our test dataset.
        # Therefore, we build a list that stores exactly this and return this list.
        # Building this list can be done with a simple list comprehension.
        # For every dictionary in our prediction list, we want to select the dictionary key (our class label) that
        # holds the highest class probability. Selecting the dictionary key with the highes class probability can
        # be done by using the function max in combination with its parameter key.
        # This simply sorts the dictionary by its values and then returns the dictionary's key of the maximum value.
        return [max(elem, key=elem.get) for elem in prediction]

### 4.2.2. Train Your Categorical Naive Bayes

In [None]:
target_attribute = "buys_computer"

naive_bayes = CategoricalNaiveBayes()
naive_bayes.fit(dataset=train_buys_computer, target_attribute=target_attribute)

#### 4.2.2.1. Prior Probability
Take a look at the pior probability:

In [None]:
print("prior:", naive_bayes.prior)

Let's assert that it contains the correct values:

In [None]:
assert naive_bayes.prior == {"no": 0.35714285714285715, "yes": 0.6428571428571429}

#### 4.2.2.2. Likelihood
Take a look at the likelihood:

In [None]:
print("likelihood:", naive_bayes.likelihood)

In [None]:
assert naive_bayes.likelihood == {
    "no": {
        "age": {"<=30": 0.6, "31-40": 0.0, ">40": 0.4},
        "income": {"high": 0.4, "medium": 0.4, "low": 0.2},
        "student": {"no": 0.8, "yes": 0.2},
        "credit_rating": {"fair": 0.4, "excellent": 0.6},
    },
    "yes": {
        "age": {
            "<=30": 0.2222222222222222,
            "31-40": 0.4444444444444444,
            ">40": 0.3333333333333333,
        },
        "income": {
            "high": 0.2222222222222222,
            "medium": 0.4444444444444444,
            "low": 0.3333333333333333,
        },
        "student": {"no": 0.3333333333333333, "yes": 0.6666666666666666},
        "credit_rating": {"fair": 0.6666666666666666, "excellent": 0.3333333333333333},
    },
}

### 4.2.3. Test Your Categorical Naive Bayes 

Let's import some test dataset:

In [None]:
from datasets.buys_computer import test_buys_computer


test_buys_computer

And test it:

In [None]:
print(
    "predicted:",
    naive_bayes.predict(test_buys_computer.iloc[:, :-1]),
    "true:",
    test_buys_computer.iloc[0, -1],
)

<div class="alert alert-info" role="alert">
    
**Task 2:**
    
Train and Test Your Naive Bayes Implementation With the Play Tennis Dataset
    
</div>    

In [None]:
from datasets.play_tennis import train_play_tennis


train_play_tennis

Train your naive Bayes:

In [None]:
# train your naive Bayes here

In [None]:
# train your naive Bayes here
naive_bayes_tennis = CategoricalNaiveBayes()
naive_bayes_tennis.fit(dataset=train_play_tennis, target_attribute="Play Tennis")

Test your newly trained decision tree with the following test dataset:

In [None]:
from datasets.play_tennis import test_play_tennis


test_play_tennis

Make predictions with your decision tree:

In [None]:
# get predictions here

In [None]:
# get predictions here
print(
    "predicted:",
    naive_bayes_tennis.predict(test_play_tennis.iloc[:, :-1]),
    "true:",
    test_play_tennis.iloc[:, -1],
)