# Investigating lwlrap

This competition uses a metric called lwlrap. It's described as:

> lwlrap measures the average precision of retrieving a ranked list of relevant labels for each test clip (i.e., the system ranks all the available labels, then the precisions of the ranked lists down to the true label are averaged). This is a generalization of the mean reciprocal rank measure (used in last year's edition of the competition) for the case where there can be multiple true labels per test item. 

>The novel "label-weighted" part means that the overall score is the average over all the labels in the test set, where each label receives equal weight (by contrast, plain lrap gives each test item equal weight, thereby discounting the contribution of individual labels when they appear on the same item as multiple other labels).

> We use label weighting because it allows per-class values to be calculated, and still have the overall metric be expressed as a simple average of the per-class metrics (weighted by each label's prior in the test set).

Below is the starter code from the competition:

In [2]:
import numpy as np
import sklearn.metrics

def calculate_overall_lwlrap_sklearn(truth, scores):
  """Calculate the overall lwlrap using sklearn.metrics.lrap."""
  # sklearn doesn't correctly apply weighting to samples with no labels, so just skip them.
  sample_weight = np.sum(truth > 0, axis=1)
  nonzero_weight_sample_indices = np.flatnonzero(sample_weight > 0)
  overall_lwlrap = sklearn.metrics.label_ranking_average_precision_score(
      truth[nonzero_weight_sample_indices, :] > 0, 
      scores[nonzero_weight_sample_indices, :], 
      sample_weight=sample_weight[nonzero_weight_sample_indices])
  return overall_lwlrap

## Building Intuition

Both the sklearn and full versions of `lwlrap` are beyond my understanding. But we can still build intuition about how this metric **feels**. What makes it go up? What makes it go down?

Let's create 100 samples on data with 10 labels and see what a "random" lwlrap looks like:

In [27]:
for _ in range(5):
    # Random test data.
    num_samples = 100
    num_labels = 10
    truth = np.random.rand(num_samples, num_labels) > 0.5
    scores = np.random.rand(num_samples, num_labels)
    print("lwlrap from sklearn.metrics =", calculate_overall_lwlrap_sklearn(truth, scores))

lwlrap from sklearn.metrics = 0.6463328664799254
lwlrap from sklearn.metrics = 0.6320509097948122
lwlrap from sklearn.metrics = 0.6751874304457333
lwlrap from sklearn.metrics = 0.6460547504025762
lwlrap from sklearn.metrics = 0.6078892474169683


So for 100 samples with 10 random labels, we see values around `0.6`-`0.7`.

Let's increase the number of labels to `80` since that's what our competition uses:

In [30]:
for _ in range(5):
    # Random test data.
    num_samples = 100
    num_labels = 80
    truth = np.random.rand(num_samples, num_labels) > 0.5
    scores = np.random.rand(num_samples, num_labels)
    print("lwlrap from sklearn.metrics =", calculate_overall_lwlrap_sklearn(truth, scores))

lwlrap from sklearn.metrics = 0.5361859180299589
lwlrap from sklearn.metrics = 0.5328551826559567
lwlrap from sklearn.metrics = 0.5481219726196813
lwlrap from sklearn.metrics = 0.5186607084213624
lwlrap from sklearn.metrics = 0.5364357582444282


Now we see values closer to `0.5`. 

Let's increase the number of samples to `1,120` because that's how many test items there are in our competition.

In [31]:
for _ in range(5):
    # Random test data.
    num_samples = 1120
    num_labels = 80
    truth = np.random.rand(num_samples, num_labels) > 0.5
    scores = np.random.rand(num_samples, num_labels)
    print("lwlrap from sklearn.metrics =", calculate_overall_lwlrap_sklearn(truth, scores))

lwlrap from sklearn.metrics = 0.5298257037088475
lwlrap from sklearn.metrics = 0.5313353766293242
lwlrap from sklearn.metrics = 0.5284328382110621
lwlrap from sklearn.metrics = 0.5336294318861579
lwlrap from sklearn.metrics = 0.5315035852546283


So there's no apparent change based solely on the number of test items.

Right now we're assuming about half the labels are `True` and half are `False`. In our competition usually only 1 or 2 labels are `True`. Let's try to account for that.

In [37]:
for _ in range(5):
    # Random test data.
    num_samples = 1120
    num_labels = 80
    truth = np.random.rand(num_samples, num_labels) < (2/80)
    scores = np.random.rand(num_samples, num_labels)
    print("lwlrap from sklearn.metrics =", calculate_overall_lwlrap_sklearn(truth, scores))

lwlrap from sklearn.metrics = 0.08389730170935047
lwlrap from sklearn.metrics = 0.0857571538950253
lwlrap from sklearn.metrics = 0.08043748764821373
lwlrap from sklearn.metrics = 0.08462396214727662
lwlrap from sklearn.metrics = 0.08541145280375675


Wow. That's a large drop. So if we have completely random scores with sparsely `True` labels, we should expect scores around `0.08`. Our lowest submitted scores were about `0.06` so this seems correct. Any difference is probably due to my `(2/80)` approximation which doesn't guarantee that a label is always present.

## Can we influence lwlrap?

Now that we have some intuition about what makes it go up and down, can we increase lwlwrap? In order to make this easier, we'll just look at a single example. 

In [182]:
num_samples = 1
num_labels = 10
#This time we'll just set the first two labels True to make things easier to look at
truth = np.zeros((1,num_labels), dtype=bool)
truth[0][0] = True
truth[0][1] = True

scores = np.random.rand(1, num_labels)

print("lwlrap from sklearn.metrics =", calculate_overall_lwlrap_sklearn(truth, scores))

lwlrap from sklearn.metrics = 0.19642857142857142


In [183]:
truth

array([[ True,  True, False, False, False, False, False, False, False,
        False]])

In [184]:
scores

array([[0.31620831, 0.73221285, 0.73619461, 0.93265508, 0.96794554,
        0.12174799, 0.14749704, 0.98200028, 0.94924133, 0.98628111]])

In [185]:
np.min(scores), np.max(scores)

(0.12174798641008422, 0.9862811094135545)

In [186]:
temp = scores[0].argsort()
ranks = np.empty_like(temp)
ranks[temp] = np.arange(len(scores[0]))
ranks

array([2, 3, 4, 5, 7, 0, 1, 8, 6, 9])

So we have a set of predictions and a given ranking of those predictions. Let's keep the same ranking but make the distance between successive predictions much smaller (0.001).

In [210]:
sameButDiff = np.zeros_like(scores)
for pos, val in enumerate(ranks):    
    sameButDiff[0][pos] = np.min(scores) + (0.001 * val)
    
sameButDiff

array([[0.12374799, 0.12474799, 0.12574799, 0.12674799, 0.12874799,
        0.12174799, 0.12274799, 0.12974799, 0.12774799, 0.13074799]])

In [212]:
print("lwlrap from sklearn.metrics =", calculate_overall_lwlrap_sklearn(truth, sameButDiff))

lwlrap from sklearn.metrics = 0.19642857142857142


So it looks like as long as we keep the order the same, the distance between predictions has no influence over the score.

What happens if we shift the scores to be around 0.5?

In [213]:
sameButDiff = np.zeros_like(scores)
for pos, val in enumerate(ranks):    
    sameButDiff[0][pos] = 0.5 + (0.001 * val)
    
sameButDiff

array([[0.502, 0.503, 0.504, 0.505, 0.507, 0.5  , 0.501, 0.508, 0.506,
        0.509]])

In [214]:
print("lwlrap from sklearn.metrics =", calculate_overall_lwlrap_sklearn(truth, sameButDiff))

lwlrap from sklearn.metrics = 0.19642857142857142


So the position of the scores doesn't matter either. This means that it's **good** to have an unconfident model that gets the relative order correct. It's completely fine if all our predictions are around 0.5.

This is useful knowledge for us, but probably bad for the competition owners. If they want to build a model for out of sample data, having a model that predi