## [日本語版はversion 1です．](https://www.kaggle.com/code/yufuin/ai4code-linearity-of-kendall-s-tau-english?scriptVersionId=95794450 "version 1") This notebook has the Japanese version (version=1).  

In [discussion: About loss function and better score](https://www.kaggle.com/competitions/AI4Code/discussion/325272), it is pointed out that the L1 loss could be effective for this task.  
This should be correct because of the property of Kendall's tau.
# In this notebook, we'll see the linearity of Kendall's tau and the correctness to use the L1 loss to train the model.

In this competition, we use Kendall's tau as the competition metric.
Kendall's tau is to measure the ordinal association between two orderings.
The range of Kendall's tau is -1 to 1.

Let's assume ordered elements that has the correct order relations between any pair of elements except one element.  
**Then, (1 - Kendall's tau) is proportional to the deviation of the order of the misaligned element.**  
Since (1 - Kendall's tau) is the value that represents how bad it is from the best score, we can call it the true loss of the task.

The assumption here is quite simillar to the setting of the competition; the task is to put the markdown cells on the correct positions while we have the correct orders for code cells.  
**Thus, if we use a model that directly predicts the correct position of the given cells, the absolute difference between the label and the prediction (abs(correct_position - pred_position), i.e., the L1 loss, should represent the true loss of the task well.**

# Why (1 - Kendall's tau) is proportional to the deviation of the order
The definition of Kendall's tau is as follows:
$$
\tau = \frac{K-L}{\left( \substack{n \\ 2} \right)} \\
K = \#\{(i,j) ~|~ i,j \in [1,n],~~ i < j,~~ O_i \lessgtr O_j,~~ O'_i \lessgtr O'_j \} \\
L = \#\{(i,j) ~|~ i,j \in [1,n],~~ i < j,~~ \neg(O_i \lessgtr O_j,~~ O'_i \lessgtr O'_j) \}
$$

Here, $n$ is the number of the elements, $O_i$ and $O'_i$ are the orders of the reference and the prediction for i-th element, respectively.  
$K$ and $L$ are the number of the correctly or incorrectly ordered pairs, respectively.

Now, assume that we have the predicted orders that has the correct order relations between any two elements except one element (e.g., an ordered sequence "a b d e <font color="red">c</font> f g h", while the correct order is the alphabetical).  
Then, $L$ equals to the number of the elements whose positions are between the misaligned position and the correct position of the misaligned element (in the example, "d" and "e", thus 2). I.e., $L$ equals to the difference of the deviation of the order of the misaligned element.  
As well, $K$ decreases by the deviation of the order.

**Thus, Kendall's tau decreases by 2 * the deviation / (nC2) from 1 (the maximum value). In other words, the true loss of the task is proportional to the deviation of the order of the misaligined element.**

# Example （supplementary material）
First, let's define helper functions.

In [None]:
# scoring function
# this is copied from https://www.kaggle.com/code/ryanholbrook/competition-metric-kendall-tau-correlation
from bisect import bisect

# Actually O(N^2), but fast in practice for our data
def count_inversions(a):
    inversions = 0
    sorted_so_far = []
    for i, u in enumerate(a):  # O(N)
        j = bisect(sorted_so_far, u)  # O(log N)
        inversions += i - j
        sorted_so_far.insert(j, u)  # O(N)
    return inversions

def kendall_tau(ground_truth, predictions):
    total_inversions = 0  # total inversions in predicted ranks across all instances
    total_2max = 0  # maximum possible inversions across all instances
    for gt, pred in zip(ground_truth, predictions):
        ranks = [gt.index(x) for x in pred]  # rank predicted order in terms of ground truth
        total_inversions += count_inversions(ranks)
        n = len(gt)
        total_2max += n * (n - 1)
    return 1 - 4 * total_inversions / total_2max

In [None]:
def task_loss(ground_truth, predictions):
    return 1 - kendall_tau(ground_truth, predictions)

We'll refer the deviation of the order of the misaligned element as "diff".  
For example, if the element is mialigned to the 7th position while the correct position is 2nd, diff=abs(2-7)=5 (obvious!).

In [None]:
def get_diff(source_index, dest_index):
    return abs(source_index - dest_index)

print("diff of 2 => 7 :", get_diff(2,7))

## Let's consider the ordered sequence whose correct ordering is alphabetical.
Now we know the correct orders for all elements except for 'c', thus we need to predict the position of 'c'.

In [None]:
correct_sequence = [c for c in "abcdefghij"]
correct_index = 2

print("correct sequence   :", correct_sequence)
print("given base sequence:", correct_sequence[:correct_index] + correct_sequence[correct_index+1:])
print("prediction target  :", correct_sequence[correct_index])
print("correct index      :", correct_index)


## If we predict the position of 'c' as 3rd, then the score will be as follows:

In [None]:
def replace(correct_sequence, source_index, dest_index):
    out = list(correct_sequence)
    value = out.pop(source_index)
    out = out[:dest_index] + [value] + out[dest_index:]
    return out

pred_index = 3

diff = get_diff(correct_index, pred_index)
prediction_c_as_3rd = replace(correct_sequence, correct_index, pred_index)

print("pred index         :", pred_index)
print("diff(:=abs(pred_index-correct_index))")
print("                   :", diff)
print("prediction_c_as_3rd:", prediction_c_as_3rd)
print("kendall_tau        :", kendall_tau([correct_sequence], [prediction_c_as_3rd]))
print("task_loss(:=1-tau) :", task_loss([correct_sequence], [prediction_c_as_3rd]))
print("task_loss/diff     :", task_loss([correct_sequence], [prediction_c_as_3rd])/diff)

## When predict as 4th:

In [None]:
pred_index = 4

diff = get_diff(correct_index, pred_index)
prediction_c_as_4th = replace(correct_sequence, correct_index, pred_index)

print("pred index         :", pred_index)
print("diff               :", diff)
print("prediction_c_as_4th:", prediction_c_as_4th)
print("kendall_tau        :", kendall_tau([correct_sequence], [prediction_c_as_4th]))
print("task_loss(:=1-tau) :", task_loss([correct_sequence], [prediction_c_as_4th]))
print("task_loss/diff     :", task_loss([correct_sequence], [prediction_c_as_4th])/diff)

## As above, the losses per the deviations of the orders (task_loss/diff) are same!
The value is actually always same for any prediction.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

CORRECT_SEQUENCE = [c for c in "abcdefghij"]
SOURCE_IDX = 2

def get_scores(source_idx):
    dests = list()
    scores = list()
    losses = list()
    losses_per_deviations = list()

    for dest_idx in range(len(CORRECT_SEQUENCE)):
        pred = replace(CORRECT_SEQUENCE, SOURCE_IDX, dest_idx)
        score = kendall_tau([CORRECT_SEQUENCE], [pred])
        loss = task_loss([CORRECT_SEQUENCE], [pred]) # or simply (1 - score)
        loss_per_deviation = loss / max(1, get_diff(SOURCE_IDX, dest_idx)) # avoid deviding by 0

        dests.append(dest_idx)
        scores.append(score)
        losses.append(loss)
        losses_per_deviations.append(loss_per_deviation)
    return pd.DataFrame({"dest_idx":dests, "kendall's tau":scores, "loss": losses, "loss_per_deviation": losses_per_deviations})

ig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))
df = get_scores(SOURCE_IDX)
sns.scatterplot(x="dest_idx", y="loss", data=df, ax=ax1)
sns.scatterplot(x="dest_idx", y="kendall's tau", data=df, ax=ax2)
sns.scatterplot(x="dest_idx", y="loss_per_deviation", data=df, ax=ax3)

## We confirmed the linearity of the true loss of the task (1 - Kendall's tau)
(Again, here we assumed the orders of all elements except one are correct.)

[The baseline notebook](https://www.kaggle.com/code/aerdem4/ai4code-pytorch-distilbert-baseline) predicts the normalized position of the markdown cells, the L1 loss should work well for the model.

## As the first part of this notebook, the coefficient (loss_per_deviation) should equal to $2 ~/~ _n C_2$. Let's check it.

In [None]:
# the following value should matches "loss/diff" in the above example
def n_C_2(n):
    return (n * (n-1))//2

len_sequence = len(correct_sequence)
decreasing_for_one_rank = 2 / n_C_2(len_sequence)

print("decreasing for one-rank-difference (should be loss/diff):", decreasing_for_one_rank)

This is it!

## If you find this helpful, please upvote💕💕💕
Thank you!