Source : [Kendall rank correlation coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient)

"Tau coefficient" redirects here. It is not to be confused with Tau distribution.

In statistics, the Kendall rank correlation coefficient, commonly referred to as Kendall's tau coefficient (after the Greek letter τ), is a statistic used to measure the ordinal association between two measured quantities. A tau test is a non-parametric hypothesis test for statistical dependence based on the tau coefficient.

It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It is named after Maurice Kendall, who developed it in 1938,[1] though Gustav Fechner had proposed a similar measure in the context of time series in 1897.[2]

## Mathematic definition

Source : [Kendall rank correlation coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient)

Let $(x1, y1)$, $(x2, y2)$, $…$, $(xn, yn)$ be a set of observations of the joint random variables $X$ and $Y$ respectively, such that all the values of $(x_i)$ and $(y_i)$ are unique. Any pair of observations $(x_i, y_i)$ and $(x_j, y_j)$, where $i \not= j$, are said to be concordant if the ranks for both elements agree: that is, if both $x_i > x_j$ and $y_i > y_j$ or if both $x_i < x_j$ and $y_i < y_j$. They are said to be discordant, if $x_i > x_j$ and $y_i < y_j$ or if $x_i < x_j$ and $y_i > y_j$. If $x_i = x_j$ or $y_i = y_j$, the pair is neither concordant nor discordant.

The Kendall τ coefficient is defined as:

$
    \tau = \frac{(\text{number of concordant pairs}) - (\text{number of discordant pairs})}{n (n-1) /2}
$


##  Implementation

ScPy : [Kendall rank on Python](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.kendalltau.html)

### scipy.stats.mstats.kendalltau¶


```python
scipy.stats.mstats.kendalltau(x, y, use_ties=True, use_missing=False)
```
Computes Kendall’s rank correlation tau on two variables x and y.

* Parameters:	

    * **x** : sequence

        First data list (for example, time).

    * **y** : sequence

        Second data list.

    * **use_ties** : {True, False}, optional

        Whether ties correction should be performed.

    * **use_missing** : {False, True}, optional

        Whether missing data should be allocated a rank of 0 (False) or the average rank (True)

* Returns:	

    * **correlation** : float

        Kendall tau

    * **pvalue** : float

        Approximate 2-side p-value.


# Kendall tau distance

wiki : [Kendall tau distance](https://en.wikipedia.org/wiki/Kendall_tau_distance)

The Kendall tau rank distance is a metric that counts the number of pairwise disagreements between two ranking lists. The larger the distance, the more dissimilar the two lists are. Kendall tau distance is also called **bubble-sort distance** since it is equivalent to the number of swaps that the bubble sort algorithm would make to place one list in the same order as the other list. The Kendall tau distance was created by Maurice Kendall.

## Definition

The Kendall tau ranking distance between two lists $L1$ and $L2$ is

$
    K(\tau_1,\tau_2) = |\{(i,j): i < j, ( \tau_1(i) < \tau_1(j) \wedge \tau_2(i) > \tau_2(j) ) \vee ( \tau_1(i) > \tau_1(j) \wedge \tau_2(i) < \tau_2(j) )\}|.
$

where 


$\tau_1(i)$ and $\tau_2(i)$ are the rankings of the element i in $L1$ and $L2$ respectively.


$K(\tau_1,\tau_2)$ will be equal to $0$ if the two lists are identical and $n(n-1)/2$ (where $n$ is the list size) if one list is the reverse of the other. Often Kendall tau distance is normalized by dividing by $n(n-1)/2$ so a value of 1 indicates maximum disagreement. The normalized Kendall tau distance therefore lies in the interval [0,1].

Kendall tau distance may also be defined as
$
    K(\tau_1,\tau_2) = \begin{matrix} \sum_{\{i,j\}\in P} \bar{K}_{i,j}(\tau_1,\tau_2) \end{matrix} 
$

where

* $P$ is the set of unordered pairs of distinct elements in $\tau_1$ and $\tau_2$
* $\bar{K}_{i,j}(\tau_1,\tau_2) = 0$ if $i$ and $j$ are in the same order in $\tau_1$ and $\tau_2$
* $\bar{K}_{i,j}(\tau_1,\tau_2) = 1$ if $i$ and $j$ are in the opposite order in $\tau_1$ and $\tau_2$.

Kendall tau distance can also be defined as the total number of discordant pairs.

Kendall tau distance in Rankings: A permutation (or ranking) is an array of $N$ integers where each of the integers between 0 and $N-1$ appears exactly once. The Kendall tau distance between two rankings is the number of pairs that are in different order in the two rankings. For example, the Kendall tau distance between [0 3 1 6 2 5 4] and [1 0 3 6 4 2 5] is four because the pairs [0-1], [3-1], [2-4], [5-4] are in different order in the two rankings, but all other pairs are in the same order.

If Kendall tau function is performed as $K(L1,L2)$ instead of $K(\tau_1,\tau_2)$ (where $\tau_1$ and $\tau_2$ are the rankings of $L1$ and $L2$ elements respectively), then triangular inequality is not guaranteed. The triangular inequality fail in cases where there are repetitions in the lists. So then we are not any more dealing with a metric.

In [1]:
%matplotlib inline
%matplotlib notebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

from analyser import *

con = psycopg2.connect("dbname=test1")

# Example

Suppose we rank a group of five people by height and by weight:

In [9]:
# Example
persons = ["A","B","C","D","E"]
data = [[1,2,3,4,5],
        [3,4,1,2,5]]

idx = ["RankbyHeight", "RankbyWeight"]

df = pd.DataFrame(data, index=idx, columns=persons)
df

Unnamed: 0,A,B,C,D,E
RankbyHeight,1,2,3,4,5
RankbyWeight,3,4,1,2,5


Here person A is tallest and third-heaviest, and so on.

In order to calculate the Kendall tau distance, pair each person with every other person and count the number of times the values in list 1 are in the opposite order of the values in list 2.

In [12]:
cols = ["Pair","Height","Weight","Count"]

data = [["(A,B)","1 < 2","3 < 4",""],
        ["(A,C)","1 < 3","3 > 1","X"],
        ["(A,D)","1 < 4","3 > 2","X"],
        ["(A,E)","1 < 5","3 < 5",""],
        ["(B,C)","2 < 3","4 > 1","X"],
        ["(B,D)","2 < 4","4 > 2","X"],
        ["(B,E)","2 < 5","4 < 5",""],
        ["(C,D)","3 < 4","1 < 2",""],
        ["(C,E)","3 < 5","1 < 5",""],
        ["(D,E)","4 < 5","2 < 5",""],]

df2 = pd.DataFrame(data, columns=cols)

In [13]:
df2

Unnamed: 0,Pair,Height,Weight,Count
0,"(A,B)",1 < 2,3 < 4,
1,"(A,C)",1 < 3,3 > 1,X
2,"(A,D)",1 < 4,3 > 2,X
3,"(A,E)",1 < 5,3 < 5,
4,"(B,C)",2 < 3,4 > 1,X
5,"(B,D)",2 < 4,4 > 2,X
6,"(B,E)",2 < 5,4 < 5,
7,"(C,D)",3 < 4,1 < 2,
8,"(C,E)",3 < 5,1 < 5,
9,"(D,E)",4 < 5,2 < 5,


Since there are 4 pairs whose values are in opposite order, the Kendall tau distance is 4. The normalized Kendall tau distance is

$
    \frac{4}{5(5 - 1)/2} = 0.4
$

A value of $0.4$ indicates that $40\%$ of pairs differ in ordering between the two lists.

In [109]:
from scipy.stats.mstats import kendalltau

In [110]:
kendalltau(df.T.RankbyHeight, df.T.RankbyWeight)

KendalltauResult(correlation=0.20000000000000001, pvalue=0.62420611476640597)

In [111]:
kendalltau(df.T.RankbyHeight, df.T.RankbyWeight, use_ties=False)

KendalltauResult(correlation=0.20000000000000001, pvalue=0.62420611476640597)

In [112]:
kendalltau(df.T.RankbyWeight, df.T.RankbyHeight)

KendalltauResult(correlation=0.20000000000000001, pvalue=0.62420611476640597)

In [113]:
s = df.T.RankbyHeight > df.T.RankbyWeight
s

A    False
B    False
C     True
D     True
E    False
dtype: bool

In [114]:
a = s.sum()
n = s.size

In [115]:
float(a) / (n*(n-1) / 2)

0.2

In [116]:
zip(persons, persons)

[('A', 'A'), ('B', 'B'), ('C', 'C'), ('D', 'D'), ('E', 'E')]

In [64]:
from itertools import combinations
l = [i for i in combinations(persons, 2)]

In [66]:
df

Unnamed: 0,A,B,C,D,E
RankbyHeight,1,2,3,4,5
RankbyWeight,3,4,1,2,5


In [122]:
comp = [df[a] < df[b] for a,b in l]
test = pd.DataFrame(comp)
res = test.join(pd.Series(l, name="Pairs"))
res

Unnamed: 0,RankbyHeight,RankbyWeight,Pairs
0,True,True,"(A, B)"
1,True,False,"(A, C)"
2,True,False,"(A, D)"
3,True,True,"(A, E)"
4,True,False,"(B, C)"
5,True,False,"(B, D)"
6,True,True,"(B, E)"
7,True,True,"(C, D)"
8,True,True,"(C, E)"
9,True,True,"(D, E)"


In [119]:
s = res.RankbyHeight != res.RankbyWeight
s

0    False
1     True
2     True
3    False
4     True
5     True
6    False
7    False
8    False
9    False
dtype: bool

In [120]:
a = s.sum()
n = df.columns.size
float(a) / (n*(n-1) / 2)

0.4

In [125]:
from scipy.stats.mstats import kendalltau_seasonal
kendalltau_seasonal(df)

{'chi2 total': 0.0,
 'chi2 trend': 0.0,
 'global p-value (dep)': nan,
 'global p-value (indep)': 1.0,
 'global tau': 0.0,
 'global tau (alt)': 0.0,
 'seasonal p-value': masked_array(data = [1.0 1.0 1.0 1.0 --],
              mask = [False False False False  True],
        fill_value = 1e+20),
 'seasonal tau': masked_array(data = [1.0 1.0 -1.0 -1.0 --],
              mask = [False False False False  True],
        fill_value = 1e+20)}

In [129]:
kendalltau_seasonal(df.values)

{'chi2 total': 0.0,
 'chi2 trend': 0.0,
 'global p-value (dep)': nan,
 'global p-value (indep)': 1.0,
 'global tau': 0.0,
 'global tau (alt)': 0.0,
 'seasonal p-value': masked_array(data = [1.0 1.0 1.0 1.0 --],
              mask = [False False False False  True],
        fill_value = 1e+20),
 'seasonal tau': masked_array(data = [1.0 1.0 -1.0 -1.0 --],
              mask = [False False False False  True],
        fill_value = 1e+20)}