# Kendall Tau's Correlation <br>
A measure of [rank correlation](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient): the similarity of the orderings of the data when ranked by each of the quantities.
A τ test is a non-parametric hypothesis test for statistical dependence based on the τ coefficient. <br>

The definition of Kendall’s tau that is used is:

$tau = (P - Q) / sqrt((P + Q + T) * (P + Q + U))$ <br>

where P is the number of concordant pairs, Q the number of discordant pairs, T the number of ties only in x, and U the number of ties only in y. If a tie occurs for the same pair in both x and y, it is not added to either T or U.

Implemention in Python: [link](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.kendalltau.html)

## Create Ranking Dataframe 

In [4]:
import pandas as pd
import scipy.stats as stats


In [12]:
# Twitter most positive rank dataframe
twitter_pos = {'brands': ['Nike', 'Adidas', 'Converse', 'Reebok','New balance'], 
                'rank': [3, 4, 2, 5, 1],
                'pos_pct' : [0.2093, 0.2023, 0.2246, 0.1085, 0.2391]}
twitter_pos = pd.DataFrame(data = twitter_pos)
twitter_pos

Unnamed: 0,brands,rank,pos_pct
0,Nike,3,0.2093
1,Adidas,4,0.2023
2,Converse,2,0.2246
3,Reebok,5,0.1085
4,New balance,1,0.2391


In [11]:
# Twitter most negative rank dataframe
twitter_neg = {'brands': ['Nike', 'Adidas', 'Converse', 'Reebok','New balance'], 
                'rank': [2, 3, 1, 5, 4],
                'neg_pct' : [0.1377, 0.1301, 0.1833, 0.528, 0.1087]}
twitter_neg = pd.DataFrame(data = twitter_neg)
twitter_neg

Unnamed: 0,brands,rank,neg_pct
0,Nike,2,0.1377
1,Adidas,3,0.1301
2,Converse,1,0.1833
3,Reebok,5,0.528
4,New balance,4,0.1087


In [10]:
# Reddit most positive rank dataframe
reddit_pos = {'brands': ['Nike', 'Converse','New balance'], 
                'rank': [2, 3, 1],
                'pos_pct' : [0.1558, 0.1176, 0.2222]}
reddit_pos = pd.DataFrame(data = reddit_pos)
reddit_pos

Unnamed: 0,brands,rank,pos_pct
0,Nike,2,0.1558
1,Converse,3,0.1176
2,New balance,1,0.2222


In [13]:
# Reddit most negative rank dataframe
reddit_neg = {'brands': ['Nike', 'Converse','New balance'], 
                'rank': [3, 2, 1],
                'pos_pct' : [0.0519, 0.1176, 0.2222]}
reddit_neg = pd.DataFrame(data = reddit_neg)
reddit_neg

Unnamed: 0,brands,rank,pos_pct
0,Nike,3,0.0519
1,Converse,2,0.1176
2,New balance,1,0.2222


In [2]:
# Survey data most positive rank dataframe
survey_pos = {'brands': ['Nike', 'Adidas', 'Converse', 'Reebok','New balance'], 
                'rank': [1, 2, 5, 4, 3],
                'pos_pct' : [0.6828, 0.6618, 0.5593, 0.5773, 0.6412]}
survey_pos = pd.DataFrame(data = survey_pos)
survey_pos

Unnamed: 0,brands,rank,pos_pct
0,Nike,1,0.6828
1,Adidas,2,0.6618
2,Converse,5,0.5593
3,Reebok,4,0.5773
4,New balance,3,0.6412


In [3]:
# Survey data most negative rank dataframe
survey_neg = {'brands': ['Nike', 'Adidas', 'Converse', 'Reebok','New balance'], 
                'rank': [4, 5, 2, 1, 3],
                'pos_pct' : [0.0848, 0.0793, 0.1130, 0.1366, 0.1029]}
survey_neg = pd.DataFrame(data = survey_neg)
survey_neg

Unnamed: 0,brands,rank,pos_pct
0,Nike,4,0.0848
1,Adidas,5,0.0793
2,Converse,2,0.113
3,Reebok,1,0.1366
4,New balance,3,0.1029


### Calculate Kendall's Tau Correlation

In [31]:
# calculate the Twitter and survey brand positive sentiment rank correlation
x1 = twitter_pos['rank']
x2 = survey_pos['rank']
tau, p_value = stats.kendalltau(x1, x2)
print(f'tau is {tau:.2f}, p value is {p_value:.2f}')

tau is 0.00, p value is 1.00


In [32]:
# calculate the Twitter and survey brand negative sentiment rank correlation
x1 = twitter_neg['rank']
x2 = survey_neg['rank']
tau, p_value = stats.kendalltau(x1, x2)
print(f'tau is {tau:.2f}, p value is {p_value:.2f}')

tau is -0.20, p value is 0.82


In [33]:
# calculate the Reddit and Survey brand positive sentiment rank correlation
x1 = survey_pos[(survey_pos['brands'] == 'Nike') |
            (survey_pos['brands'] == 'Converse') | 
            (survey_pos['brands'] == 'New balance')]['rank']
x2 = reddit_pos['rank']
tau, p_value = stats.kendalltau(x1, x2)
print(f'tau is {tau:.2f}, p value is {p_value:.2f}')

tau is 0.33, p value is 1.00


In [34]:
# calculate the Reddit and Survey brand negative sentiment rank correlation
x1 = survey_neg[(survey_neg['brands'] == 'Nike') |
            (survey_neg['brands'] == 'Converse') | 
            (survey_neg['brands'] == 'New balance')]['rank']
x2 = reddit_neg['rank']
tau, p_value = stats.kendalltau(x1, x2)
print(f'tau is {tau:.2f}, p value is {p_value:.2f}')

tau is 0.33, p value is 1.00


__*Conclusion*__: <br>
Taking 0.05 as the significance level, here we observe the `p value` all way more than 0.05, which means we cannot reject the H0. Definitely not ideal conclusion. 

# Spearman Rank Correlation <br>
[Reference](https://pythonfordatascienceorg.wordpress.com/correlation-python/#correlation-kendalltau)