# **KogSys-ML-B Introduction to Machine Learning**
## **Ensembles and Evaluation**
---

To set up a conda environment suitable for this notebook, you can use the following console commands:

```bash
conda create -y -n ens-eval python=3.13
conda activate ens-eval
python -m pip install -r requirements.txt
```

**Note**: Conda can become very hard-drive hungry when you use many environments. Consider regularly deleting environments you no longer need and running the ``conda clean --all`` command to remove no longer needed packages and cached files.

You can also install the requirements for this notebook into an existing environment by running the cell below:

In [17]:
!python -m pip install -q -r requirements.txt

In [18]:
from __future__ import annotations

from random import shuffle

import numpy as np
import pandas as pd
from numpy.typing import ArrayLike
from sklearn.base import ClassifierMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_score,
    recall_score,
)
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample

### **Data Preprocessing**

Last time, we worked with a dataset which was already set up to be used with ``scikit-learn``. Today, we will work with a less favorable base and learn to work around it, "wrangling" our raw data into a shape we can work with.

The dataset we will be working with today is the Spotify tracks dataset, which is available on [Huggingface](https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset). However, while this dataset is almost already usable, we will consider a modified version to learn some basic data transformations which will be helpful to you on any future Machine Learning tasks.

With this notebook, you downloaded four files: ``spotify-1.csv``, ``spotify-2.csv``, ``spotify-3.parquet``, and ``spotify-test.csv``:
- ``spotify-1.csv`` and ``spotify-2.csv`` contain the same rows, identified by the column ``"track_id"``, but different columns.
- ``spotify-3.parquet`` contains additional, complete rows, but is saved in a different file format and some column types don't match.
- ``spotify-test.csv`` contains the complete test data. No modifications are needed, but you are not allowed to use this for any purpose during training, only to evaluate the **final** model.

#### **Task: Load ``spotify-1.csv`` and ``spotify-2.csv`` and join them _on_ the column ``"track_id"``**

The resulting DataFrame should have the shape ``(43075, 20)``.

In [20]:
df_1 = pd.read_csv("spotify-1.csv")
df_2 = pd.read_csv("spotify-2.csv")

df_12 = pd.merge(df_1, df_2, how="inner", on="track_id")

#### **Task: Load ``spotify-3.parquet`` and combine it with your result from the previous task.**

Note that some columns from the new frame will load with the wrong datatype. To save you time on searching, the columns in question are ``"popularity"`` and ``"explicit"``. ``pandas`` will, however, not raise an error for this, but will silently raise the dtype for the columns to a superset of both types. Make sure that you change the datatype of the columns to the most _expressive_ one. The resulting DataFrame should have the shape ``(71792, 20)``.

If you name your resulting frame ``df``, you can use the assertions in the cell below to check whether your solution worked.

**Note:** In this case, it doesn't matter whether you change the datatype of the columns _before_ or _after_ combining the two frames. However, it is better practice to do it beforehand and combine only DataFrames with matching types for all columns.

<div class="alert alert-block alert-danger">

**Note:** If you get an **ArrowKeyError**, remove **PyArrow** from your environment and **requirements.txt** and replace it with **fastparquet**.

</div>

In [25]:
df_3 = pd.read_parquet("spotify-3.parquet")
df_3["popularity"] = df_3["popularity"].astype(int)
df_3["explicit"] = df_3["explicit"].astype(bool)

df = pd.concat([df_12, df_3], axis=0)

In [26]:
assert df["popularity"].dtype == int
assert df["explicit"].dtype == bool

#### **Task: Filter the columns to only columns which can sensibly contribute to decision-making without overfitting the data.**

In [7]:
exclude_cols: list = ["track_id", "artists", "album_name", "track_name"]

df = df.drop(columns=exclude_cols)

#### **Task: Performing a dataset split – manually**

In most cases, you will do just fine with using ``Scikit-Learn``'s ``train_test_split`` (or later: ``PyTorch``'s ``random_split``). However, there are some edge cases where you have to handle splitting yourself, so this task teaches you the basics of how to go about this: _Index Lists_.

Essentially, the goal is to list all indices of your data, shuffle that list, and then simply divide it into size-based chunks! In this next cell, write your own function which takes an ``ArrayLike`` object and a list of fractions (i.e. ``float``s) as input and returns a list of ``ArrayLike`` objects of the same lengths as the fraction list.

In [None]:
def mysplit(arr: ArrayLike, ratios: list[float]) -> list[ArrayLike]:
    """
    An example implementation how basic index-list splitting can be handled.

    Parameters
    ----------
    arr: ArrayLike
        ArrayLike object to perform the split on
    ratios: list[float]
        List of float ratios defining the relative split sizes. Must sum to 1.

    Returns
    -------
    list[ArrayLike]
        A list of the same length as ratios with the corresponding array subsets.
    """
    # Ensure that the list of arguments adds to 1, raise Value Error if it doesn't
    if not sum(ratios) == 1:
        raise ValueError(f"ratio list {ratios} does not sum to 1.")
    
    # collect index list and shuffle it
    index_list = [i for i in range(len(arr))]
    shuffle(index_list)


    lengths = [int(len(index_list) * r) for r in ratios]    # turn ratios into absolute split lengths based on the total number of samples
    lengths[-1] += len(index_list) - sum(lengths)           # ensure length sums to total, i.e. apply a correction to the last split (avoid off-by-one)

    ret = []
    start = 0                                               # first split starts at 0th element in index list
    for le in lengths:
        stop = start + le                                   # calculate stop point based on lengths
        ret.append(
            arr.iloc[index_list[start:stop],:]              # use iloc indexing to get the split array
        )
        start = stop                                        # update start point

    return ret

print(len(df))
tmp_1, tmp_2 = mysplit(df, [0.8, 0.2])
print(len(tmp_1), len(tmp_2), sum((len(tmp_1), len(tmp_2))))

71792
57433 14359 71792


### **Learning an Ensemble**

#### **Task / Baseline: Use ``scikit-learn``'s ``RandomForestClassifier`` to train a random forest of 50 ID3 trees.**

Now, load ``spotify-test.csv`` as well and use it with the classifier's ``score`` method. You may need to reorder (``reindex``) the columns in the DataFrame to match the ones of your training frame.

In [9]:
X_train, y_train = df.drop(columns=["track_genre"]), df["track_genre"]
rf = RandomForestClassifier(
    n_estimators=50,
    criterion="entropy",
)
rf.fit(X_train, y_train)

0,1,2
,n_estimators,50
,criterion,'entropy'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [10]:
df_test = pd.read_csv("spotify-test.csv")
df_test = df_test.reindex(df.columns, axis=1)
X_test, y_test = df_test.drop(columns=["track_genre"]), df_test["track_genre"]
rf.score(X_test, y_test)

0.4365702824669898

#### **Task: Voting, DIY**

Your task is to create your own ensemble of trees, with a twist: Each tree should be trained on a random subset (say, $80\%$) of the training data, and validated on the rest. You do not need to implement a $k$-fold like system, simply performing a random split each time will suffice. Use the validation scores to create a weighted decision system which takes the validation performance of each individual tree into account. Do not implement the subspace sampling for the trees which is part of the original Random Forest algorithm.

You can use the following class skeleton to help get you started!

**Note:** Focus on the algorithm, not the performance. Trying to implement this decision algorithm while also trying to maximize performance is very difficult, and is not the goal for this course. It is okay if your implementation is both slower and less powerful than the ``scikit-learn`` base – you are just starting out, after all!

**Hint:** To get your bootstrap sample, you can use ``scikit-learn``'s ``resample`` function.

**Hint:** Add up all class predictions using the computed weights and return the class with the maximum score!

**Hint:** Common errors when working with ``numpy`` arrays arise from using no or incorrect ``dtype`` specifications when creating the arrays.

In [None]:
class DIYForest(ClassifierMixin):
    """
    A DIY Random Forest Class. By using ``ClassifierMixin``, some methods are included automatically, as long as ``fit`` and ``predict`` are implemented. Set the following class attributes in the constructor:

    Attributes
    ----------

    M: np.ndarray
        An array of models, in this case DecisionTreeClassifiers. You may also use a list if you aren't comfortable with numpy arrays. Make sure to adjust the type hint in that case.
    w: np.ndarray
        An array of model weights, which will be filled with validation scores during training. If you use arrays, initialize an array of zeros of the same shape as M in the constructor. If you use lists, you can have this grow organically.
    val_size: float
        The fraction of the training data to use for validation, i.e. calculating model weights
    """

    M: np.ndarray
    w: np.ndarray
    val_size: float

    def __init__(self, n_trees: int, val_size: float = 0.2, **tree_params) -> None:
        """
        In the constructor, set the class attributes. Initialize tree objects at this point.

        Parameters
        ----------
        n_trees: int
            How many trees to include in the forest
        val_size: float
            The fraction of the training data to use for validation
        **tree_params: dict
            Parameters to pass on to the tree constructor, i.e. ``DecisionTreeClassifier(**tree_params)``.
        """
        self.M = np.array([
            DecisionTreeClassifier(**tree_params) for _ in range(n_trees)
        ])
        self.w = np.zeros_like(self.M, dtype=np.float32)
        self.val_size = val_size

    def fit(self, X: ArrayLike, y: ArrayLike) -> DIYForest:
        """
        Fit each tree in the forest using decision attributes ``X`` and target attribute ``y``.

        Parameters
        ----------
        X: ArrayLike
            training examples (only decision attributes)
        y: ArrayLike
            labels

        Returns
        -------
        self
        """
        for idx, tree in enumerate(self.M):
            _X, _y = resample(X, y)                                 # create bootsrap sample (Dt)
            X_train, X_val, y_train, y_val = train_test_split(      # create a validation split for calculating the voting weight of the tree
                _X, _y, test_size=self.val_size
            )
            tree = tree.fit(X_train, y_train)                       # train the tree
            self.w[idx] = tree.score(X_val, y_val)                  # calculate weight and store in w attribute

        return self

    def predict(self, X: ArrayLike) -> np.ndarray:
        """
        The ensemble makes predictions for the labels of samples ``X``.

        Parameters
        ----------
        X: ArrayLike
            data samples

        Returns
        -------
        np.ndarray
            Array of shape [X.shape[0],] containing the predictions
        """
        preds: np.ndarray = np.zeros(shape=[X.shape[0], self.M.shape[0]], dtype=object)
                                                                    # empty array to put the predictions into, with samples in rows and individual trees in columns
        for idx, tree in enumerate(self.M):
            preds[:, idx] = tree.predict(X)                         # create prediction COLUMN for each tree by batch-evaluating all samples

        ret: np.ndarray = np.zeros(                                 # prepare one-dimensional output array of the same length as samples in X
            shape=[
                X.shape[0],
            ],
            dtype=object,
        )

        for idx, pred in enumerate(preds):
            _df = pd.DataFrame({"class": pred, "weight": self.w})   # For each sample, associate the trees prediction with the corresponding tree weights
            ret[idx] = _df.groupby("class").sum().idxmax().item()   # Sum up weights per class and return the class with the highest value

        return ret


In [12]:
forest = DIYForest(50)
forest = forest.fit(X_train, y_train)
forest.score(X_test, y_test)

0.42916039890801716

### **Evaluation**

Finally, let's calculate some of the evaluation metrics for the ``scikit-learn`` and our model!

#### **Accuracy**

Accuracy, implemented in ``sklearn.metrics.accuracy_score``, is defined as $$\operatorname{Accuracy}=\frac{\text{TP}+\text{TN}}{\text{FP}+\text{FN}}$$

In [13]:
print(
    f"SKLearn: {accuracy_score(y_test, rf.predict(X_test))}",
    f"\nOurs: {accuracy_score(y_test, forest.predict(X_test))}"
)

SKLearn: 0.4365702824669898 
Ours: 0.42916039890801716


#### **Precision**

Precision, implemented in ``sklearn.metrics.precision_score``, is defined as $$\operatorname{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}}$$

In [14]:
print(
    f"SKLearn: {precision_score(y_test, rf.predict(X_test), average="weighted")}",
    f"\nOurs: {precision_score(y_test, forest.predict(X_test), average="weighted")}"
)

SKLearn: 0.43771433438877816 
Ours: 0.4292485818570276


#### **Recall**

Recall, implemented in ``sklearn.metrics.recall_score``, is defined as $$\operatorname{Recall}=\frac{\text{TP}}{\text{TP}+\text{FN}}$$

In [15]:
print(
    f"SKLearn: {recall_score(y_test, rf.predict(X_test), average = "weighted")}",
    f"\nOurs: {recall_score(y_test, forest.predict(X_test), average = "weighted")}"
)

SKLearn: 0.4365702824669898 
Ours: 0.42916039890801716


#### **F1-Score**

F1-Score, implemented in ``sklearn.metrics.f1_score``, is defined as $$\operatorname{F1}=\frac{2\times\text{TP}}{2\times\text{TP}+\text{FP}+\text{FN}}$$

In [16]:
print(
    f"SKLearn: {f1_score(y_test, rf.predict(X_test), average = "weighted")}",
    f"\nOurs: {f1_score(y_test, forest.predict(X_test), average = "weighted")}"
)

SKLearn: 0.42566260115524845 
Ours: 0.4220250722293657
