# Scoring Rules and Objective Functions -- How Should We Compare Models?

What does it mean for a model to be "Good?" It's a question philosophers have long pondered, if by "Philosophers" you mean "Every field of academia besides philosophy." We'll compare models with something called a *scoring rule*. A scoring rule starts with the predictions your model generates. Then, it gives the model points if the prediction is good, or takes some away if it does badly. At the end, we can compare the scores for each model.

## Assuming an Integer Number of Cats

Let's work through this with an example involving cats, because on the internet, everything *must* involve cats. We have 1000 cats. Every cat is either nice or mean. Mean cats are more likely to scratch their owners' arms than nice cats, but it's not a perfect correlation. Sometimes nice cats will scratch if you scare them, and sometimes mean cats won't scratch if you give them a treat. Our goal is to classify these cats -- cat-egorize them, if you will.

<center>
<figure>
    <img src=https://purrfectlove.net/wp-content/webp-express/webp-images/doc-root/wp-content/uploads/2020/04/cat-coronavirus-1280x720.png.webp width="700">
    <figcaption><i>This responsible cat is wearing a mask, so we can definitely classify him as nice.</i></figcaption>
</figure>
</center>

Let's do some simulations! Let's say that 90% of cats are nice, but 10% aren't. Mean cats are about 10% more likely to scratch their owners.

In [1]:
using DataFrames
using ParetoSmooth
using LinearAlgebra
using Random
using StatsFuns
using StatsPlots
using Turing

Random.seed!(1776)  # Setting a seed for reproducibility
Turing.setprogress!(false)  # Turn off progress monitor.

n_cats = 1000  # Let's simulate 1000 cats!
percent_mean = 0.1  # In this simulation, 10% of cats are mean.
percent_scratched = 0.2  # 20% of cat owners have scratches.
# Having a mean cat increases the log-odds that the owner is scratched by 0.1.
relative_risk = 0.1  # this is about a 10% increase.

function generate_cats(
    n_cats,
    percent_mean, 
    percent_scratched,
    relative_risk
)

    is_mean = Bool.(rand(Bernoulli(percent_mean), n_cats))

    logit_scratched = @. logit(percent_scratched) + relative_risk * (is_mean - percent_mean)
    scratched_owner = @. rand(BernoulliLogit(logit_scratched)) |> Bool

    col_names = [:is_mean, :scratched_owner]
    columns = hcat(is_mean, scratched_owner)  # hehe, h-cat
    return DataFrame(columns, col_names) 
end


data = generate_cats(
    n_cats, 
    percent_mean, 
    percent_scratched, 
    relative_risk
)

┌ Info: [Turing]: progress logging is disabled globally
└ @ Turing /home/lime/.julia/packages/Turing/y0DW3/src/Turing.jl:22
┌ Info: [AdvancedVI]: global PROGRESS is set as false
└ @ AdvancedVI /home/lime/.julia/packages/AdvancedVI/yCVq7/src/AdvancedVI.jl:15


Unnamed: 0_level_0,is_mean,scratched_owner
Unnamed: 0_level_1,Bool,Bool
1,0,0
2,0,1
3,0,0
4,0,0
5,0,0
6,0,1
7,0,0
8,0,0
9,0,0
10,0,0


Now, let's bring in some experts and ask them what they think is the probability that a cat is nice. Each expert uses a different mental model to guess this probability.

The first model says all cats are nice, because they are all fluffy, so it assigns probability 1 to every cat being nice. The second is designed by an experienced cat expert -- a veteran-arian. He can *exactly* calculate the probability that a cat is nice, using the data-generating process we laid out above, plus Bayes' theorem.

In [2]:
# P(mean|scratch) = P(scratch|mean) * P(mean) / P(scratch)
# This is Bayes' theorem! But don't worry, this is just a toy example to show you the ropes.
# You won't have to calculate this stuff yourself. Turing will do that for you!
scratch_given_mean = logistic(
    logit(percent_scratched) - relative_risk * percent_mean
)
scratch_given_nice = logistic(
    logit(percent_scratched) + relative_risk * (1 - percent_mean)
)
function apply_bayes(scratched::Bool)
    if scratched
        return scratch_given_mean * percent_mean / percent_scratched
    else
        return scratch_given_nice * percent_mean / (1 - percent_scratched)
    end
end

model_1 = repeat([0], n_cats)
model_2 = apply_bayes.(data[:, :scratched_owner]);

Now, let's take a look at three common scoring rules from machine learning. Let's see how much of an advanatage they show for model 2! The three scoring rules are:
1. The zero-one scoring rule: A model gets 1 point if it correctly classifies a cat, and 0 if it classifies it incorrectly. 
2. The total probability assigned to the correct outcome: For example, a model gets 1 point if the actual outcome was assigned probability 1, and .5 points if the model assigned a probability of 0.5.
3. The log scoring rule: Here, a model gets `log(p)` points, where `p` is the probability that our model assigns to an observation. For example, the model gets -1 points if it assigns a probability of `exp(-1)` to the event that actually happened. The log score is sometimes also called the Bayes score.

Let's see how each of these stacks up, starting with zero-one loss.

In [3]:
function mle(prediction)
    return prediction > 0.5
end

function zero_one(predictions, data)
    maximum_lik = mle.(predictions)
    correct = (maximum_lik .== data[:, :is_mean])
    return sum(correct)  # This will give % correctly classified.
end

hcat(zero_one(model_1, data), zero_one(model_2, data))

1×2 Matrix{Int64}:
 895  895

The two models perform exactly the same! That doesn't make any sense. What's going on?

In [4]:
hcat(all(mle.(model_1) .== 0), all(mle.(model_2) .== 0))

1×2 Matrix{Bool}:
 1  1

So the zero-one rule is broken because both models (correctly!) predict that all cats are more likely to be nice than not. This means that even though the first model is super overconfident -- it says there's a 100% chance that every cat is nice! -- the model doesn't get punished. This is also why you should *never* judge a model based on how many observations it classifies correctly. If you ever see this score in a paper, ignore it. The percentage of observations classified correctly tells you pretty much nothing about the quality of the classifier.

This scoring rule is rarely used explicitly, but it's very common for it to be used to fit a model. In fact, it's probably the most common scoring rule in statistics, because any model using the maximum likelihood estimator is maximizing the zero-one score! This is why we don't recommend using maximum likelihood estimators. Sometimes they work well, but in some cases (like this one) they can do very badly.

Let's try the second scoring rule. This rule gives a model some number of points that equals the average probability assigned by our model to the observed data.

In [5]:
function assigned_probability(prediction, cat_mean)
    if cat_mean
        return prediction
    else
        return 1 - prediction
    end
end

function mean_probability(predictions, data)
    return sum(assigned_probability.(predictions, data[:, :is_mean]))
end

hcat(mean_probability(model_1, data), mean_probability(model_2, data))

1×2 Matrix{Float64}:
 895.0  861.851

What?! Now the right model does even worse than the bad model! How? 

Well, our scoring rule doesn't just ignore overconfidence. It flat-out rewards it. Both of the rules we just outlined above are *improper* scoring rules. A scoring rule is called improper if you can improve your score by "cheating" -- that is, by giving the wrong probability on purpose. If you judge models based on the average probability assigned to an event occurring, then you can improve your expected score by assigning a probability of 1 to any event with a probability greater than 0.5, since this will probably earn you more points.

Let's try the Bayes scoring rule. A model gets a score equal to `log(p)`, where `p` is the probability assigned to the event that actually happened. This might seem a bit weird at first, but there's a good reason for it. How surprising is it when you see heads come up twice in two coin flips? We can figure that out by calculating the probability of seeing two heads in a row: a half times a half is a quarter. 

Did you catch that? When we calculate the probability of two observations happening together, we *multiply* them. But when we calculate the total score for a model, we *add* the scores for each observation together. So what we need is a function that turns multiplication into addition. In mathese, we'd say we want `f(prob_1 * prob_2) = f(prob_1) + f(prob_2)`, where `f` is the scoring rule. This function is the logarithm! This is where the log scoring rule comes from.[^1]

That makes some sense, but is the log score a good scoring rule? Let's try it out:

In [6]:
function log_score(predictions, data)
    return sum(
        @. data[:, :is_mean] * log(predictions) + 
        (1 - data[:, :is_mean]) * log(1 - predictions)
    )
end

scores = hcat(log_score(model_1, data), log_score(model_2, data))

1×2 Matrix{Float64}:
 -Inf  -391.124

Wow. That's a pretty big difference -- infinitely big. Any model that assigns a probability of 0 to something that actually happened will immediately get a score of negative infinity, since `log(0) = -Inf`.

This might seem surprising at first, but it's exactly what we want from a scoring rule. When we're comparing models with a scoring rule, the goal is to find the model that best represents reality. If model A says there's a 0% chance of `x`, and then we see `x`, we know for a fact that model A can't be the best one. Model A says something that actually happened was impossible, so we can say with 100% confidence that it doesn't represent reality.

So, the log score looks great! Are there any problems with it? Well, the main one is that it's hard to interpret. A probability makes sense, but what are all these weird log things? We can fix this pretty easily, though. First, we divide the score by the sample size to get an average score. Then we take the exponent of the score. This brings the scores back onto the probability scale, so they fall between 0 and 1.

In [7]:
exp.(scores / n_cats)

1×2 Matrix{Float64}:
 0.0  0.676296

Ahh, much better! The scores look like regular probabilities now. This number is called the geometric mean of the probability, or GMP for short.[^2] Unlike scoring rules like the classification percentage, the GMP works even when the data set is very imbalanced. (i.e. some groups are much bigger than others, the way that we have lots of nice cats in our data set). We can kind of interpret these GMP numbers as saying that, after adjusting for overconfidence, the first model is 0% accurate, while the second is 68% accurate. This is a very mathematically informal way of describing the GMP, but it helps us make it more intuitive.

## My Cat is Continuous-ly Mean to Me

What if the outcome variable isn't just true/false? Cats can't be broken up into two buckets of mean and nice; they fall along a spectrum.

Most frequentist models deal with this by creating a model that generates a single predicted value using something like a linear regression, and then measure how far that prediction is. This is the principle behind statistics like r^2, the mean squared error, or mean absolute errors. 

This has a couple problems. First, it completely neglects uncertainty. Even if model A and model B predict the same *average* value for an observation, the two models might have very different predictions about what the *distribution* of observations will look like. Estimate-based errors can also give bad answers if we have skewed data, multimodal distributions, or otherwise ugly data. For example, comparing models using the mean squared error will generally favor models that underestimate the skew of the data.

Because of this, we're going to stick to working with probabilities instead of messing around with residuals or estimators. Here's the great thing about Bayesian methods: Bayesian methods always return a full probability distribution, not just a single estimate, so we can easily extend the definition of the Bayes score that we used earlier. Unlike in traditional machine learning or frequentist statistics, there's no need to learn a bunch of different objective functions for continuous outcomes.[^3]

## Footnotes

[^1]: There's another sense in which the log score is optimal: No other scoring rule is both local and proper. A local scoring rule is one where a model's score only depends on the data you actually see. With a non-local scoring rule, the score can depend on "imaginary data," i.e. data that we could have observed but didn't. I'll paraphrase a story I heard elsewhere on why we might want this:
> A scientist measures the voltages running through several wires, and finds they range from 88 to 97 volts. A statistician computes an estimated mean of 94 volts from this sample, giving him a statistically significant result. 
Later, the statistician discovers that the voltmeter reads only as far as 100 volts, at which point it will cap out; this implies the population is "censored." Therefore, the sample mean is a biased estimate, because if any of the wires had a voltage of 101 it would have been truncated to 100 volts. The statistician adjusts for this bias and concludes that the estimate  is closer to 96 volts. The whole paper must be rewritten to take this imaginary bias into account; the fact that the voltmeter never actually undercounted any voltages is ignored.

Because the bias of an estimate is not a local scoring rule, trying to pick estimators that minimize the bias will give absurd results like this one quite often.

[^2]: A geometric mean is a type of mean that's designed for numbers that you want to multiply together, instead of adding. The geometric mean of `n` numbers is defined as `(x_1 * x_2 ... * x_n)^(1/n)`. You can use some properties of logarithms to show that this calculation gives the same answer as the one I used to define the GMP.

[^3]: There's only one real difference between the continuous and discrete cases. When we move to continuous outcomes, we have to use the probability *density* instead of the probability of a single outcome. Because of that, the log-score is often called the *log predictive density*, or *LPD* for short. In addition, the GMP no longer represents a probability, because probability densities can exceed 0 or 1; we do not recommend trying to interpret the GMP when working with continuous outcomes.

In [18]:
if isdefined(Main, :TuringTutorials)
    Main.TuringTutorials.tutorial_footer(WEAVE_ARGS[:folder], WEAVE_ARGS[:file])
end