In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px

# What is Spearman's rank correlation coefficient?

According to [Wikipedia's definition](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient):

In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named after Charles Spearman and often denoted by the Greek letter $\rho$  (rho) or as $r_{s}$, is a **nonparametric measure of rank correlation** (statistical dependence between the rankings of two variables). It assesses **how well the relationship between two variables can be described using a monotonic function.**

The Spearman correlation between two variables is equal to the Pearson correlation between the rank values of those two variables; while Pearson's correlation assesses linear relationships, Spearman's correlation assesses **monotonic relationships** (whether linear or not). If there are no repeated data values, a **perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.**

Intuitively, the Spearman correlation between two variables will be **high** when observations have a **similar** (or identical for a correlation of 1) rank (i.e. relative position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the two variables, and **low** when observations have a **dissimilar** (or fully opposed for a correlation of −1) rank between the two variables.

Spearman's coefficient is appropriate for **both continuous and discrete ordinal variables**.

# Formula

$${r_{s}=1-{\frac {6\sum d_{i}^{2}}{n(n^{2}-1)}}}$$

Where:
- $d_{i} = R(X_{i}) - R(Y_{i})$ is the difference between the two ranks of each observation.
- $n$ is the number of observations.


# Practical example

Let's see a practical example to know how it is calculated. This tutorial's data is taken from Eugene O'Loughlin [video](https://www.youtube.com/watch?v=DE58QuNKA-c).

Let's say we have gathered data from 10 stores. We measure the `distance` to the city center and the `price` of a product and we want to see how the variables are correlated.

In [None]:
data = {
    "store": [1,2,3,4,5,6,7,8,9,10],
    "distance": [50,175,250,375,425,585,720,810,875,950],
    "price": [1.8,1.25,2,1,1.1,1.2,0.8,0.6,1.05,0.85]
}
df = pd.DataFrame(data)
df

In [None]:
fig = px.scatter(df, 
                    x = "distance",
                    y = "price")
fig.update_traces(marker = dict(size = 10, color='LightSkyBlue'))
fig.update_coloraxes(showscale = False)
fig.update_layout(template = "plotly_dark")
fig.show()

# 1. Rank columns

We first need to rank the columns. That is assign the highest rank to the highest value up until the lowest value which will have lowest rank. We can do it using: `pandas.rank()` function.

In [None]:
df["distance_rank"] = df["distance"].rank()
df["price_rank"] = df["price"].rank()
df

# 2. Calculate $d^2$

In order to calculate $d^2$ we take the difference between the ranks of the two columns and then square the difference.

In [None]:
df["d"] = df["distance_rank"] - df["price_rank"]
df["d2"] = df["d"].apply(lambda x: x*x)
df

# 3. Calculate Spearman's Rank Correlation Coefficient

We compute $\rho$ using the formula:

In [None]:
def spearman_rank(df):
    sum_d2 = sum(df["d2"])
    n = len(df["d2"])
    r = 1 - sum_d2*6/((n*n-1)*n)
    return r

r = spearman_rank(df)
print(f"Spearman's Rank Correlation Coefficient is: {r}")

### Double check with `scipy`

In [None]:
from scipy import stats

stats.spearmanr(df["distance"], df["price"])

That's it! Hope you liked it!