# 4. Classification

This JupyterNotebook is part of an exercise series titled *Classification* based on the lecture of the same title.

This exercise series is divided into three parts. There will be one exercise session per part (= one part per week):

- **4.1.** Decision Tree (*notebook of the week before last*)
- **4.2.** Naive Bayes (*last weeks notebook*)
- **4.3.** AdaBoost (*this notebook*)
    - **4.3.1.** [Dataset](#4.3.1.-Dataset)
    - **4.3.2.** [Instantiate and Train a AdaBoost Model](#4.3.2.-Instantiate-and-Train-a-AdaBoost-Model)
    - **4.3.3.** [Obtain Predictions with Your AdaBoost Model](#4.3.3.-Obtain-Predictions-with-Your-AdaBoost-Model)
    - **4.3.4.** [Evaluate Your AdaBoost Model](#4.3.4.-Evaluate-Your-AdaBoost-Model)
        - **4.3.4.1.** [Import `scikit-learn`'s Metrics Module](#4.3.4.1.-Import-scikit-learn's-Metrics-Module)
        - **4.3.4.2.** [Calculate the Evaluation Metrics](#4.3.4.2.-Calculate-the-Evaluation-Metrics)

<div class="alert alert-block alert-warning">

**Important:**
    
Work on the respective part yourself **BEFORE** each exercise session. The exercise session is **NOT** intended to take a first look at the exercise sheet, but to solve problems students had while preparing the exercise sheet beforehand.
    
</div>

**Importing Libraries**

Feel free to import more libraries here.

In [None]:
import pandas as pd
from typing import List, Any
from math import log
import random

from classifier import DecisionTree

## 4.3. AdaBoost
Your task in this exercise is to implement AdaBoost from scratch. It will use a slightly modified version of the previously implemented decision tree. This decision tree will randomly select an attribute as the splitting attribute and only a tree of depth three is constructed.

### 4.3.1. Dataset 
We will use the following dataset in this JupyterNotebook:

In [None]:
dataset = pd.read_csv("car_train.csv")

# view dataset
dataset.head()

This dataset is a slightly modified version of the [car evaluation dataset](https://archive.ics.uci.edu/ml/datasets/Car+Evaluation) from the UCI Machine Learning Repository. Originally, this dataset has four class values. For the sake of this example we modified it to binary classification:

In [None]:
dataset["condition"].unique()

<div class="alert alert-info" role="alert">

**Task 1:**
    
Implement AdaBoost

</div>

**Your task is to implement some functions in the following `AdaBoost` object: `fit`, `_missclassification_error`, `_calculate_new_weights`, `_normalize_weights`, and `_classifiy_single_tuple`.**

- `fit` already comes with some lines of codes. Your task here is to implement the missing crucial functionality to train multiple weak classifiers.
- `_missclassification_error` should return 1 if the true value and predicted value are different and return 0 if they are the same.
- `_calculate_new_weights` will calculate weights for the next iteration. 
- `_normalize_weights` "normalizes" weights by dividing each weight element by the sum of all weights.
- `_classifiy_single_tuple` will, as the name suggest, return a class label for a single tuple. It is used in the function `predict`.


**Hint:**

Sampling a dataset with replacement can be done with the function `sample` from pandas. Make sure to set the parameter `replace=True`, otherwise rows will only be sampled once. Additionally, we want to sample rows.

You can take a look at the documentation of `sample` [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html).

In [None]:
class AdaBoost:
    def __init__(self, n_classifier: int = 5, seed: int = 42) -> None:
        random.seed(seed)

        self.n_classifier: int = n_classifier

        # Function fit will later populate these variables
        self.target_attribute: str = None  # target attribute of training dataset
        self.classes: List[
            Any
        ] = []  # list of class labels, used later in function _classifiy_single_tuple()
        self.weights: List[List[float]] = []  # weights of each trained classifier
        self.betas: List[float] = []  # beta values of each trained classifier
        self.classifiers: List[DecisionTree] = []  # list of trained classifiers

    def fit(self, dataset: pd.DataFrame, target_attribute: str) -> None:
        """Fit AdaBoost by training multiple weak learners to the given dataset."""
        # Assign target_attribute to object variable
        self.target_attribute = target_attribute

        # Initialize weights
        self._initialize_weights(dataset=dataset)

        # Get and store number of classes to object variable
        self.classes = dataset[target_attribute].unique().tolist()

        # Get number of tuples in dataset
        n_tuples = dataset.shape[0]

        while True:
            # Stop while-loop when there are enough classifiers
            if len(self.classifiers) == self.n_classifier:
                break

            raise NotImplementedError("Implement this function.")

    def _initialize_weights(self, dataset: pd.DataFrame) -> None:
        """Initialize weights if they have not been initialized before.
        Formula: weights = 1 / number of tuples in data"""
        if not self.weights:
            n_tuples = dataset.shape[0]
            self.weights.append([1 / n_tuples for _ in range(n_tuples)])

    def _weak_learn(self, dataset: pd.DataFrame) -> Any:
        """Fit a weak learner and return this classifier."""
        # Instantiate weak classifier
        classifier = DecisionTree()

        # Train weak classifier
        classifier.fit(dataset=dataset, target_attribute=self.target_attribute)

        return classifier

    @staticmethod
    def _missclassification_error(true: List[float], predicted: List[float]) -> int:
        """Calculate the missclassification error.
        Returns 1 if missclassified (true != predicted), 0 if correct (true == predicted).
        """
        raise NotImplementedError("Implement this function.")

    @staticmethod
    def _calculate_new_weights(
        weights: List[float], beta: float, error: List[float]
    ) -> List[float]:
        """Update weights by multiplying weights with beta to the power of 1 - error."""
        raise NotImplementedError("Implement this function.")

    @staticmethod
    def _normalize_weights(weights: List[float]) -> List[float]:
        """Normalize weights. Formula: weights = weights / sum(weights)"""
        raise NotImplementedError("Implement this function.")

    def predict(self, dataset: pd.DataFrame) -> List[Any]:
        """Return prediction for a given dataset."""
        return [
            self._classifiy_single_tuple(tuple.to_frame().T)
            for _, tuple in dataset.iterrows()
        ]

    def _classifiy_single_tuple(self, tuple: pd.DataFrame) -> Any:
        """Classifies a single tuple."""
        raise NotImplementedError("Implement this function.")

In [None]:
class AdaBoost:
    def __init__(self, n_classifier: int = 5, seed: int = 42) -> None:
        random.seed(seed)

        self.n_classifier: int = n_classifier

        # Function fit will later populate these variables
        self.target_attribute: str = None  # target attribute of training dataset
        self.classes: List[
            Any
        ] = []  # list of class labels, used later in function predict()
        self.weights: List[List[float]] = []  # weights of each trained classifier
        self.betas: List[float] = []  # beta values of each trained classifier
        self.classifiers: List[DecisionTree] = []  # list of trained classifiers

    def fit(self, dataset: pd.DataFrame, target_attribute: str) -> None:
        """Fit AdaBoost by training multiple weak learners to the given dataset."""
        # Assign target_attribute to object variable
        self.target_attribute = target_attribute

        # Initialize weights
        self._initialize_weights(dataset=dataset)

        # Get and store number of classes to object variable
        self.classes = dataset[target_attribute].unique().tolist()

        # Get number of tuples in dataset
        n_tuples = dataset.shape[0]

        while True:
            # Stop while-loop when there are enough classifiers
            if len(self.classifiers) == self.n_classifier:
                break

            # Normalize weights
            weights_normalized = self._normalize_weights(weights=self.weights[-1])

            # Sample dataset with replacement based on the distribution of variable weights_normalized
            current_dataset = dataset.sample(
                n=n_tuples,
                weights=weights_normalized,
                replace=True,
                random_state=42,
                axis=0,
            )

            # Call weak_learn to obtain the current trained classifier
            current_classifier = self._weak_learn(dataset=current_dataset)

            # Obtain prediction from current classifier
            current_predictions = current_classifier.predict(dataset=current_dataset)

            # Determine missclassification error on every element
            missclassification_errors = [
                self._missclassification_error(true_value, predicted)
                for true_value, predicted in zip(
                    current_dataset[target_attribute],
                    current_predictions,
                )
            ]

            # Calculate overall error
            current_error = sum(
                [
                    w_i * err_i
                    for w_i, err_i in zip(weights_normalized, missclassification_errors)
                ]
            )

            # If error is greater than 0.5 abort loop and start again
            if current_error > 0.5:
                continue

            # Calculate beta
            current_beta = current_error / (1 - current_error)

            # Append current beta to the list of betas
            self.betas.append(current_beta)

            # Calculate and normalize weights for next iteration
            new_weights = self._calculate_new_weights(
                weights=self.weights[-1],
                beta=current_beta,
                error=missclassification_errors,
            )

            # Append current weights to the list of weights
            self.weights.append(new_weights)

            # Append current weak classifier to the list of classifiers
            self.classifiers.append(current_classifier)

    def _initialize_weights(self, dataset: pd.DataFrame) -> None:
        """Initialize weights if they have not been initialized before.
        Formula: weights = 1 / number of tuples in data"""
        if not self.weights:
            n_tuples = dataset.shape[0]
            self.weights.append([1 / n_tuples for _ in range(n_tuples)])

    def _weak_learn(self, dataset: pd.DataFrame) -> Any:
        """Fit a weak learner and return this classifier."""
        # Instantiate weak classifier
        classifier = DecisionTree()

        # Train weak classifier
        classifier.fit(dataset=dataset, target_attribute=self.target_attribute)

        return classifier

    @staticmethod
    def _missclassification_error(true: List[float], predicted: List[float]) -> int:
        """Calculate the missclassification error.
        Returns 1 if missclassified (true != predicted), 0 if correct (true == predicted).
        """
        missclassified = 1 - int(true == predicted)
        return missclassified

    @staticmethod
    def _calculate_new_weights(
        weights: List[float], beta: float, error: List[float]
    ) -> List[float]:
        """Update weights by multiplying weights with beta to the power of 1 - error."""
        return [w_i * beta ** (1 - err_i) for w_i, err_i in zip(weights, error)]

    @staticmethod
    def _normalize_weights(weights: List[float]) -> List[float]:
        """Normalize weights. Formula: weights = weights / sum(weights)"""
        weight_sum = sum(weights)
        if weight_sum == 0:
            return [0 for _ in weights]
        return [w_i / weight_sum for w_i in weights]

    def predict(self, dataset: pd.DataFrame) -> List[Any]:
        """Return prediction for a given dataset."""
        return [
            self._classifiy_single_tuple(tuple.to_frame().T)
            for _, tuple in dataset.iterrows()
        ]

    def _classifiy_single_tuple(self, tuple: pd.DataFrame) -> Any:
        """Classifies a single tuple."""
        # Initialize class weights:
        class_weights = {current_class: 0 for current_class in self.classes}

        # Calculate weight of each classifier's vote
        classifier_weight = [
            log(1 / beta_i, 2) if beta_i > 0 else 0 for beta_i in self.betas
        ]

        # Obtain predictions from all classifiers
        predicitons_per_classifier = [
            current_classifier.predict(tuple)[0]
            for current_classifier in self.classifiers
        ]

        # Update class weights
        for current_weight, current_prediction in zip(
            classifier_weight, predicitons_per_classifier
        ):
            class_weights[current_prediction] += current_weight

        # Return class with the highest weight
        return max(class_weights, key=class_weights.get)

### 4.3.2. Instantiate and Train a AdaBoost Model

In [None]:
target_attribute = "condition"
adaboost = AdaBoost()
adaboost.fit(dataset=dataset, target_attribute=target_attribute)

### 4.3.3. Obtain Predictions with Your AdaBoost Model

To test this model, we import the corresponding test set:

In [None]:
test_dataset = pd.read_csv("car_test.csv")

# view dataset
test_dataset.head()

Get predictions:

In [None]:
predictions = adaboost.predict(test_dataset.iloc[:, :-1])

We will store the true values for convenience:

In [None]:
true_values = test_dataset.iloc[:, -1].tolist()

### 4.3.4. Evaluate Your AdaBoost Model

In our lecture you learned how to evaluate models. In the following cells we evaluate our previously trained AdaBoost model.

`scikit-learn` provides multiple metrics to compute. Depending on which problem you have at hand (clustering, classification, or regression) different metrics are available. `scikit-learn` has a good overview of its functions on this [page](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics).

In this exercise we are solely interested in (some) classification metrics, namely:
- Accuracy, function `accuracy_score`. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score)
- Sensitivity/Recall, function `recall_score`. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score)
- Precision, function `precision_score`. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score)
- Specificity. This metric unfortunately is not available. However, it is simple to compute.
- Confusion matrix, class `ConfusionMatrixDisplay` with function `from_predictions`. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay.from_predictions)
- Receiver Operating Characteristic (ROC), function `roc_curve`. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve)
- Area Under the Curve (AUC), function `auc`. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html#sklearn.metrics.auc)
- F1, function `f1_score`. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)
- Classification report, function `classification_report`. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report)

#### 4.3.4.1. Import `scikit-learn`'s Metrics Module

In [None]:
from sklearn import metrics

#### 4.3.4.2. Calculate the Evaluation Metrics

<div class="alert alert-info" role="alert">
    
**Task 2: Calculate Accuracy**
    
Apply function `metrics.accuracy_score` to calculate the model's accuracy. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score).
    
</div>    

In [None]:
# apply metrics.accuracy_score here

In [None]:
metrics.accuracy_score(y_true=true_values, y_pred=predictions)

<div class="alert alert-info" role="alert">
    
**Task 3: Calculate Sensitivity/Recall**

Apply function `metrics.recall_score` to calculate the model's sensitivity. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score).

</div>    
    
**Hint:** Make sure to set parameter `pos_label='acc'`.
    

In [None]:
# apply metrics.recall_score here

In [None]:
metrics.recall_score(y_true=true_values, y_pred=predictions, pos_label="acc")

<div class="alert alert-info" role="alert">
        
**Task 4: Implement and Calculate Specificity**

Implement a function that calculates the specificity. Apply this function afterwards.    
    
</div>    

Unfortunately, `scikit-learn` does not provide a function to simply compute the specificity. Yet, we know how to compute it: $\frac{\text{True Negatives}}{\text{Negatives}}$.

Lucky for us, `scikit-learn` has the function `confusion_matrix` ([link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix)) which computes all necessary metrics we need: true negatives, false positives, false negatives, and true positives in exactly this order.

**Hint:** Make sure to correctly label the metrics of the confusion matrix.

In [None]:
def specificity(y_true: List[Any], y_pred: List[Any]) -> float:
    raise NotImplementedError("Implement this function.")

In [None]:
def specificity(y_true: List[Any], y_pred: List[Any]) -> float:
    tp, fn, fp, tn = metrics.confusion_matrix(y_true=y_true, y_pred=y_pred).ravel()
    return tn / (fp + tn)

In [None]:
specificity(y_true=true_values, y_pred=predictions)

<div class="alert alert-info" role="alert">
    
#### Task: Plot Confusion Matrix

`scikit-learn`'s class `ConfusionMatrixDisplay` provides plotting capabilities, for instance with function `from_predictions`. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html#sklearn.metrics.ConfusionMatrixDisplay.from_predictions)
    
</div>    

In [None]:
# apply metrics.ConfusionMatrixDisplay.from_predictions here

In [None]:
metrics.ConfusionMatrixDisplay.from_predictions(y_true=true_values, y_pred=predictions)

<div class="alert alert-info" role="alert">
    
**Task 5: Calculate ROC Curve Measure**

Apply `metrics.roc_curve`. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html#sklearn.metrics.roc_curve).

</div>

**Hint:** Convert textual description of `true_values` and `predictions` into their numerical equivalent. Set `'acc'=1` and `'unacc'=0`.

In [None]:
# apply metrics.roc_curve here

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(
    y_true=[1 if elem == "acc" else 0 for elem in true_values],
    y_score=[1 if elem == "acc" else 0 for elem in predictions],
    pos_label=1,
)

print("False Positive Rate", fpr)
print("True Positive Rate", tpr)
print("Thresholds", thresholds)

Currently, from the values we do not "see" anything. We will use these in AUC and later to plot the ROC curve.

<div class="alert alert-info" role="alert">
    
**Task 6: Calculate Area Under the Curve (AUC)**

Apply function `metrics.auc` to calculate the model's AUC value. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html#sklearn.metrics.auc).

</div>    

**Hint:** Take a look at the documentation. You need to call `roc_curve` first. More specifically, you need the False Positive Rate and True Positive Rate.

In [None]:
# apply metrics.auc here

In [None]:
metrics.auc(fpr, tpr)

<div class="alert alert-info" role="alert">
    
**Task 7: Plot ROC Curve**

Apply function `metrics.RocCurveDisplay` to plot the model's ROC curve. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.RocCurveDisplay.html).

</div>    

**Hint:** You need the values returned by applying `metrics.roc_curve` and `metrics.auc`. Make sure to use matplotlibs functionality to show the plot.

In [None]:
import matplotlib.pyplot as plt

In [None]:
# apply metrics.RocCurveDisplay here

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(
    y_true=[1 if elem == "acc" else 0 for elem in true_values],
    y_score=[1 if elem == "acc" else 0 for elem in predictions],
)

roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(
    fpr=fpr, tpr=tpr, roc_auc=roc_auc, estimator_name="example estimator"
)
display.plot()
plt.show()

<div class="alert alert-info" role="alert">
    
#### Task: Calculate F1

Apply function `metrics.f1_score` to calculate the model's F1 value. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score).

</div>    

**Hint:** We have a binary classification module. Additionally, make sure to set the parameter `pos_label` accordingly.

In [None]:
# apply metrics.f1_score here

In [None]:
metrics.f1_score(
    y_true=true_values,
    y_pred=predictions,
    average="binary",  # it is the default, but included here to show that we have a binary problem.
    pos_label="acc",
)

<div class="alert alert-info" role="alert">

#### Task: Print a Classification Report
Apply function `metris.classification_report` that calculates and prints some important metrics in one single function call. [Link to documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report).
    
</div>    

**Hint:** For pretty print make sure to use the function `print`.

In [None]:
# apply metrics.classification_report here

In [None]:
print(metrics.classification_report(y_true=true_values, y_pred=predictions))