<br><br>
<span style="font-size:2em;font-weight:lighter;">194.025 Introduction to Machine Learning</span><br>
<span style="font-size:3em;font-weight:normal;line-height:70%;">Assignment 5: How to grow and care for a Tree</span>

---



Welcome to the 5th assignment of our course **Introduction to Machine Learning**. You will be able to earn up to a total of 10 points. Please read all descriptions carefully to get a full picture of what you have to do.

**Remark:** Some code cells are put to read-only. Please execute them regardless as they contain important code. You can run a jupyter cell by pressing `SHIFT + ENTER`, or by pressing the play button on top (in the row where you can find the save button). Cells where you have to implement code contain the comment `# YOUR CODE HERE` followed by `raise NotImplementedError`. Simply remove the `raise NotImplementedError`and insert your code.

Some other code cells start with the comment `# hidden tests ...`. Please do not change them in any way as they are used to grade the tasks after your submission.

In this assignment, you will implement important parts of a machine learning pipeline. We begin with implementing a weak baseline classifier. Then, we move on to implementing evaluation metrics that allow us to determine how good our classifier performs. Finally, we use cross-validation to investigate how our model performs on unseen data.

In [98]:
import random
import numpy as np

def set_seeds():
    random.seed(42)
    np.random.seed(42)

set_seeds()

#### Dummy Classifier (2 Points)

We will start by implementing a **dummy classifier**. This classifier is intended to completely ignore the given datapoints and just predict the most common class in the training set.
When casting `fit` on a training set `X, y` this classifier is supposed to determine the most common class _C_ in the training set. When calling `predict` this classifier will simply classify every example as class _C_. This classifier should be written such that it is compatible with the _sklearn_ ML framework, this also shows you how simple it is to add custom models to sklearn. For this, the class `DummyClassifier` inherits from `sklearn` classification classes and we only need to overwrite `fit` and `predict`.

In [99]:
from sklearn.base import BaseEstimator, ClassifierMixin


class DummyClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self):
        pass

    def fit(self, X, y):
        """
        Fits the dummy classifier on the training set (X,y) i.e., extracts and stores the most common class

        Parameters
        ----------
        X : array-like (numpy.ndarray with floats or ints) of shape (n_samples, n_features)
           Meaning we have in X we have n_samples datapoints of dimension n_features and a corresponding class for each datapoint in y (e.g. y[1] is the class of datapoint X[1,:])
        y : array-like (numpy.ndarray with ints) of shape (n_samples,)
        """
        # YOUR CODE HERE

        # Mean of y is between 0 and 1, if 1 is more common, it will get rounded to 1, otherwise 0
        #n_samples = len(y)
        #common_class = np.round(np.mean(y))

        self.most_common_class_ = np.bincount(y).argmax()

    def predict(self, X):
        """
        Performs predictions based on the datapoints (X)
        The classifier should output the most common class from the dataset

        Parameters
        ----------
        X : array-like (numpy.ndarray with floats or ints) of shape (n_samples2, n_features2)

        Returns
        -------
        y : array-like (numpy.ndarray with ints) of shape (n_samples2,)
        """
        # YOUR CODE HERE
        #raise NotImplementedError()

        # Predict the most common class for all inputs
        y = np.full(X.shape[0], self.most_common_class_)
        return y

In [100]:
# hidden tests - DO NOT CHANGE THIS CELL

#### Evaluation Metrics (3 Points)

To evaluate how well our model performs we need a way to compute a metric that compares predictions `y_predict` against ground-truth `y_true`. For this you will implement two different classification metrics from the lecture: accuracy and f1-score.

In [101]:
def accuracy(y_true, y_pred):
    """
    Takes ground truth classes and predict classes and returns accuracy (see lecture).
    You can assume that every class is either 0 or 1.

    Expected output: float with accuracy.

    Parameters
    ----------
    y_true : numpy.ndarray of integers with shape (n_samples,) all integers are either 0 or 1
        Ground truth classes (i.e. the correct classes)
    y_pred : numpy.ndarray of integers with shape (n_samples,) all integers are either 0 or 1
        Predicted classes 
    """
    # YOUR CODE HERE

    n = len(y_true)
    true_positive = np.sum((y_true == 1) & (y_pred == 1))
    true_negative = np.sum((y_true == 0) & (y_pred == 0))
    
    result: float = (true_positive + true_negative)/n
    return result 


def f1(y_true, y_pred):
    """
    Takes ground truth classes and predict classes and returns the f1 score (see lecture).
    You can assume that every class is either 0 or 1. Treat 1 as positive and 0 as negative.

    Expected output: float with f1 score.

    Parameters
    ----------
    y_true : numpy.ndarray of integers with shape (n_samples,) all integers are either 0 or 1
        Ground truth classes (i.e. the correct classes)
    y_pred : numpy.ndarray of integers with shape (n_samples,) all integers are either 0 or 1
        Predicted classes 
    """
    # YOUR CODE HERE

    true_positive = np.sum((y_true == 1) & (y_pred == 1))
    false_positive = np.sum((y_true == 0) & (y_pred == 1))
    false_negative = np.sum((y_true == 1) & (y_pred == 0))

    # Avoid division by zero
    if true_positive + false_positive == 0 or true_positive + false_negative == 0:
        return 0.0  

    precision = true_positive / (true_positive + false_positive)
    recall = true_positive / (true_positive + false_negative)

    if precision + recall == 0:
        return 0.0

    result: float = 2 * precision * recall / (precision + recall)
    return result    

In [102]:
# hidden tests - DO NOT CHANGE THIS CELL

In [103]:
# hidden tests - DO NOT CHANGE THIS CELL

#### Accuracy vs F1 Score (2 Points)

You have heard in the lecture that the accuracy can be misleading. This exercise is intended to strengthen your intuition of this problem, for this you will construct an example where this is the case.  More precisely, you must define predictions `y_pred` and groundtruths `y_true` such that the accuracy is at least 0.95 and the F1 score is at most 0.6. Both predictions and ground truths should have the shape of a numpy array with 0 and 1 as entry.

In [104]:
# Enter arrays below
y_true = np.array([0,1], dtype=int)
y_pred = np.array([0,1], dtype=int)

# YOUR CODE HERE

# F1 Score: Harmonic mean of precision and recall. It punishes classifiers that have very unbalanced precision and recall.
# It reveals the failure of the classifier to detect the positive class at all.
y_true = np.array([0]*95 + [1]*5, dtype=int)
y_pred = np.array([0]*100, dtype=int)

print(f"Accuracy: {accuracy(y_true, y_pred)}")
print(f"F1 Score: {f1(y_true, y_pred)}")

Accuracy: 0.95
F1 Score: 0.0


In [105]:
# hidden tests - DO NOT CHANGE THIS CELL

#### Cross Validating (3 Points)

In the lecture you have seen how to split your dataset using the holdout method. Here, we will use 4-fold _cross validation_ to evaluate our dummy classifier on different splits of data.

In [106]:
from sklearn.model_selection import KFold

set_seeds()

X = np.random.random((10000, 10))
y = np.random.randint(low=0, high=5, size=(10000,))

##### 1. Split your data
_sklearn_ provides you with a method to get a cross-validation split for the dataset with [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html). Use this to split the training set (X,y) such that
- It gets split into 4 folds (i.e. 4-fold cross validation)
- Do **not** set `shuffle` to `True` (this allows us to reproduce your solution)


##### 2. Evaluate your model on each split 
For every $i$  from $\{1,2,3,4 \}$ select fold $i$ as test set and the remaining folds as training set. Train your dummy classifier on the training set and evaluate it on the test set. Collect the prediction accuracy on each split in an array called `accuracy_per_split`. 
In the end `accuracy_per_split` should contain 4 different floating numbers. Note that since `y` is generated by uniform distribution the average accuracy should be roughly 0.2.

In [110]:
accuracy_per_split = []

# YOUR CODE HERE

kf = KFold(n_splits=4)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    dc = DummyClassifier()
    dc.fit(X_train, y_train)
    y_pred = dc.predict(X_test)

    acc = np.mean(y_pred == y_test)  # simple accuracy
    accuracy_per_split.append(acc)

print(f"Average accuracy is {sum(accuracy_per_split)/len(accuracy_per_split):.2f}")

Average accuracy is 0.21


In [111]:
# hidden tests - DO NOT CHANGE THIS CELL