## Spearman Correlation Coefficient

Let's take a moment to try to develop some intuition about the metric used in this competition: [Spearman Correlation Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).

> Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter rho or as rs, is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function.

Some notes:
 - It only cares about order
 - It ranges from -1 to 1

In [5]:
import numpy as np
from scipy.stats import spearmanr

In [6]:
def print_spearmanr(a, b):
    v = spearmanr(a, b)
    print(v, v.correlation)

In [8]:
# When sequence order matches perfectly, we get 1.0
a = np.array([0., 1., 2., 3.])
b = np.array([0., 1., 2., 3.])
print_spearmanr(a, b) # --> 1.

SpearmanrResult(correlation=1.0, pvalue=0.0) 1.0


In [9]:
# Even if the values are not the same, as long as the order is correct, we get 1.0
b2 = np.array([4., 5., 6., 7.])
print_spearmanr(a, b2)  # --> 1.

SpearmanrResult(correlation=1.0, pvalue=0.0) 1.0


### What happens when our target (a) has identical values in it?

In [14]:
# What happens when we have ties
a = np.array([0.5, 0.5, 0.7, 0.7])
b = np.array([4., 4.01, 6., 7.])
print_spearmanr(a, b) # --> 0.89

SpearmanrResult(correlation=0.8944271909999159, pvalue=0.10557280900008413) 0.8944271909999159


In [17]:
# What happens when we have ties AND get the tie correct in our output
a = np.array([0.5, 0.5, 0.7, 0.7])
b = np.array([4., 4., 6., 7.])
print_spearmanr(a, b) # --> 0.94

SpearmanrResult(correlation=0.9428090415820635, pvalue=0.05719095841793652) 0.9428090415820635


What's interesting here is that the score will swing wildly if we don't predict when values are identical. For example, in the first case we predicted `4.01` and in the second we predicted `4.00`. Despite this tiny change, the score varies hugely.

This is probably important because many of the values in the training set are identical, so we need to come up with a good way to predict when values are the same.