In this notebook, we're going to talk about the benefits of creating a naïve baseline model to establish whether a machine learning model is taking advantage of the features to make more accurate predictions.

# Important: Run this code cell each time you start a new session!

In [None]:
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install os
!pip install scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sklearn

In [None]:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import accuracy_score, f1_score

# Load the dataset
diabetes_dataset = datasets.load_diabetes(as_frame=True)
df = diabetes_dataset.frame

# Rename the features for clarity
df = df.rename(columns={'s1': 'total serum cholesterol',
                        's2': 'low-density lipoproteins',
                        's3': 'high-density lipoproteins',
                        's4': 'total cholesterol',
                        's5': 'log of serum triglycerides',
                        's6': 'blood sugar'})

In [None]:
def generate_regressor(orig_df):
    """
    Train and test a regression model on the input DataFrame, returning the
    regressor, the feature names, and a dictionary of all the relevant data
    orig_df: the input DataFrame
    """
    # Set random number generator so the results are always the same
    np.random.seed(42)

    # Get the names of the features
    feature_names = df.columns.tolist()
    feature_names.remove('target')

    # Split the data into train and test sets
    train_df, test_df = train_test_split(orig_df, test_size=0.2)

    # Separate features from labels
    x_train = train_df.drop('target', axis=1).values
    y_train = train_df['target'].values
    x_test = test_df.drop('target', axis=1).values
    y_test = test_df['target'].values

    # Create and train the model
    regr = LinearRegression()
    regr.fit(x_train, y_train)

    # Use the model to predict on the test set
    y_pred = regr.predict(x_test)

    # Create a nested dictionary of all the data for easier retrieval
    data = {'train': {'x': x_train, 'y': y_train},
            'test': {'x': x_test, 'y': y_test},
            'pred': {'x': x_test, 'y': y_pred}}
    return regr, feature_names, data

In [None]:
def generate_classifier(orig_df):
    """
    Train and test a classification model on the input DataFrame, returning the
    classifier, the feature names, and a dictionary of all the relevant data
    orig_df: the input DataFrame for regression
    """
    # Set random number generator so the results are always the same
    np.random.seed(42)

    # Copy the DataFrame since we will be modifying it
    df = orig_df.copy()

    # Get the names of the features
    feature_names = df.columns.tolist()
    feature_names.remove('target')

    # Turn the label into a binary variable
    df['target'] = df['target'] > 150

    # Split the data into train and test sets
    train_df, test_df = train_test_split(df, test_size=0.2)

    # Separate features from labels
    x_train = train_df.drop('target', axis=1).values
    y_train = train_df['target'].values
    x_test = test_df.drop('target', axis=1).values
    y_test = test_df['target'].values

    # Create and train the model
    clf = DecisionTreeClassifier()
    clf.fit(x_train, y_train)

    # Use the model to predict on the test set
    y_pred = clf.predict(x_test)

    # Create a nested dictionary of all the data for easier retrieval
    data = {'train': {'x': x_train, 'y': y_train},
            'test': {'x': x_test, 'y': y_test},
            'pred': {'x': x_test, 'y': y_pred}}
    return clf, feature_names, data

# How Do We Know Our Model Is Good?

In some of our previous lectures, we evaluated the performance of our models using a variety of metrics: accuracy and F1 score for classification, mean-squared error for regression, etc. We saw that these numbers improved as we made adjustments to our machine learning pipeline. However, we didn't really get into what could be considered a "good model" versus a "bad model".

If you're working in a specific problem domain, then you might have a target accuracy you are trying to achieve. For example, the American Heart Association recommends that any device or algorithm designed to estimate blood pressure should have an average accuracy of ±3 mmHg for systolic blood pressure and ±2 mmHg for diastolic blood pressure.

While it is nice to have a target that we are trying to reach, it is equally important to have a starting point to know where we are coming from before we even start building our model.

Let's say that we wanted to build a machine learning model to help us gamble on the outcome of a coin flip or a dice roll. Regardless of the type of coin or dice it's given, it always achieves an accuracy of 30% on its test data. Is that actually good? Consider these three situations:

| Scenario     | Human guesser     | Quality of machine learning model     |
|--------------|--------------|--------------|
| **Fair coin flip** | 1/2 options = 50% | 30% < 50%, so the model is worse than a human |
| **Rigged coin flip** | 80% | 30% << 80%, so the model is even worse in this situation |
| **Fair dice roll** | 1/6 options = 17% | 30% > 17%, so the model is actually outperforming humans |

Even though we always said that the accuracy of the model was 30% in these scenarios, the way we interpreted the quality of our model varied each time.

One of the best ways of contextualizing the performance of your model is by establishing a ***naïve baseline***. Some people refer to a naïve baseline as a ***dummy model*** because it follows a predefined rule or strategy rather than making any intelligent or informed predictions.

# Naïve Baselines for Regression

Before we generate regression baselines for our toy diabetes dataset, let's first remind ourselves of how well our regression model worked. We could look at any number of performance metrics, but we will focus on the mean-absolute error (MAE).

In [None]:
_, _, data = generate_regressor(df)
y_test = data['test']['y']
y_pred = data['pred']['y']
model_mae = mean_absolute_error(y_test, y_pred)
print(f'Mean absolute error: {model_mae:0.2f}')

`scikit-learn` provides a `DummyRegressor` object with a similar interface to the other machine learning model objects we have created so far. In other words, we can call `.fit()` and `.predict()` on it just like any other model. However, this regression model determines its predictions solely on the distribution of the labels it sees during training.

In [None]:
from sklearn.dummy import DummyRegressor

The most important parameter in this object is the `strategy` parameter, which determines how the `DummyRegressor` makes predictions. It can take one of the following values:
* **“mean”:** Always predicts the mean of the training set
* **“median”:** Always predicts the median of the training set
* **“quantile”:** Always predicts a specified quantile of the training set according to the `quantile` parameter (i.e., `quantile = 0.5` is the same as `median`)
* **“constant”:** Always predicts a constant value according to the `constant` parameter

Let's see how well we can predict the disease progression score using all of these naïve methods:

In [None]:
quant = 0.5 #@param {type:"slider", min:0, max:1, step:0.1}
const = 175 #@param {type:"slider", min:0, max:350, step:25}

# Get the required parts of the dataset
x_train = data['train']['x']
y_train = data['train']['y']
x_test = data['test']['x']
y_test = data['test']['y']

# Print the model's performance
print(f"Model's mean absolute error: {model_mae:0.2f}")

# Try all strategies
for strat in ['mean', 'median', 'quantile', 'constant']:
    # Train the dummy regressor on the training data
    dumb_regr = DummyRegressor(strategy=strat, quantile=quant, constant=const)
    dumb_regr.fit(x_train, y_train)

    # Test the dummy regressor on the test data
    dumb_y_pred = dumb_regr.predict(x_test)

    # Evaluate MAE for the dummy regressor
    dummy_mae = mean_absolute_error(y_test, dumb_y_pred)
    print(f'Mean absolute error for {strat}: {dummy_mae:0.2f}')

Although it seemed like the MAE of our model was very high, comparing to these naïve baselines reveals that our model is actually learning quite a bit of useful information from the features.

# Naïve Baselines for Classification

Now let's compare the accuracy of our classifier to some naïve baselines using the `DummyClassifier` object.

In [None]:
_, _, data = generate_classifier(df)
y_test = data['test']['y']
y_pred = data['pred']['y']
model_acc = accuracy_score(y_test, y_pred)
print(f'Accuracy: {model_acc:0.2f}')

In [None]:
from sklearn.dummy import DummyClassifier

Before we describe the strategies that `DummyClassifier` can take, we will first need to introduce one new piece of terminology. ***One-hot encoding*** is a popular technique used to represent categorical variables as binary vectors. Let's say we have three possible classes: negative (0), neutral (1), and positive (2). Using one-hot encoding, we could convert these labels to binary vectors as follows:

| Label | One-hot Encoded Vector |
|-------|------------------------|
| Negative | `[1, 0, 0]` |
| Neutral | `[0, 1, 0]` |
| Positive | `[0, 0, 1]` |


You might recall that classifiers have a method called `.predict_proba()` that returns the probability that the given sample belongs in a specific class. This is in contrast to `.predict()`, which simply reports the class to which the sample most likely belongs.

When the model gets trained, it's actually getting trained on one-hot encoded vector labels. When you call `.predict()`, the model is calling `.predict_proba()` to predict a vector of probabilities and then picking the class with the highest probability.

With this detail in mind, let's talk about the prediction strategies supported by the `DummyClassifier` object:
* **“most_frequent”:** Always returns the most frequent class label in the training data. If you call `.predict_proba()`, the model will return the matching one-hot encoded vector.
* **“prior”:** The output of `.predict()` will be the same as if you were using `"most_frequent"`. However, the output of `.predict_proba()` will be continuous probabilities according to the distribution of the training set rather than one-hot encoded vectors.
* **“stratified”:** The output of `.predict_proba()` randomly samples one-hot vectors from the distribution of the training set. The `.predict()` method returns the corresponding class label for each vector.
* **“uniform”:** Generates predictions at random with equal probability from the list of unique classes observed in the training set
* **“constant”:** Always predicts a constant label according to the `constant` parameter

Let's see how well we can predict the disease progression severity level using all of these naïve methods:

In [None]:
const = 1 #@param {type:"slider", min:0, max:1, step:1}

# Get the required parts of the dataset
x_train = data['train']['x']
y_train = data['train']['y']
x_test = data['test']['x']
y_test = data['test']['y']

# Print the model's performance
print(f"Model's accuracy: {model_acc:0.2f}")

# Try all strategies
for strat in ['most_frequent', 'prior', 'stratified', 'uniform', 'constant']:
    # Train the dummy classifier on the training data
    dumb_clf = DummyClassifier(strategy=strat, constant=const)
    dumb_clf.fit(x_train, y_train)

    # Test the dummry classifier on the test data
    dumb_y_pred = dumb_clf.predict(x_test)

    # Evaluate accuracy for the dummy classifier
    dummy_acc = accuracy_score(y_test, dumb_y_pred)
    print(f'Accuracy for {strat}: {dummy_acc:0.2f}')

Again, we see that our model is outperforming our naïve baselines with respect to these metrics. As a sanity check, let's re-run the same code but look at F1 score rather than accuracy:

In [None]:
const = 0 #@param {type:"slider", min:0, max:1, step:1}

# Get the required parts of the dataset
x_train = data['train']['x']
y_train = data['train']['y']
x_test = data['test']['x']
y_test = data['test']['y']

# Print the model's performance
model_f1 = f1_score(y_test, y_pred)
print(f"Model's F1 score: {model_f1:0.2f}")

# Try all strategies
for strat in ['most_frequent', 'prior', 'stratified', 'uniform', 'constant']:
    # Train the dummy classifier on the training data
    dumb_clf = DummyClassifier(strategy=strat, constant=const)
    dumb_clf.fit(x_train, y_train)

    # Test the dummry classifier on the test data
    dumb_y_pred = dumb_clf.predict(x_test)

    # Evaluate accuracy for the dummy classifier
    dummy_f1 = f1_score(y_test, dumb_y_pred)
    print(f'F1 score for {strat}: {dummy_f1:0.2f}')

Notice that some of our naïve baselines get an F1 score of 0. This is because F1 score is the harmonic mean of precision and recall. When a naïve classifier always predicts negative, it gets a perfect precision because it never gets a false positive; however, it also gets zero recall because it never gets a true positive. This shows that it is important to think carefully about which metrics you use to evaluate both your own model and potential baselines.

# Other Baselines

The baselines we've discussed so far are naïve because they rely solely on the distribution of the labels or predetermined rules to make decisions. That doesn't mean that these are the only kinds of baselines you can use as a frame of reference. Here are some examples of other kinds of "baselines" you might consider:
* **Model architecture:** If you want to show that a random forest classifier is the right model for your problem, your baseline can come in the form of a decision tree trained on the same set of data.
* **Features:** If you want to show that you need all of your features to generate the best model possible, your baselines can come in the form of models that are trained on subets of your features. With our toy Diabetes Dataset, for example, the baselines could be a model trained on just the demographic variables and a model trained on just the blood serum test results.