**Table of contents**<a id='toc0_'></a>    
- [Relationships between variables](#toc1_)    
- [Covariance](#toc2_)    
  - [Measures of correlation](#toc2_1_)    
    - [Pearson correlation](#toc2_1_1_)    
    - [Spearman correlation](#toc2_1_2_)    
    - [Kendall Tau correlation](#toc2_1_3_)    
- [Correlation vs causation](#toc3_)    
- [Extra: Correlation among different types of variables](#toc4_)    
- [Extra: Covariance maths](#toc5_)    
- [Extra: StatQuest Videos](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Relationships between variables](#toc0_)

![image.png](https://imgs.search.brave.com/uatPOeWfRfnVWY2MwwPaLkKXledacH5nbNl_JyHVcB0/rs:fit:860:0:0/g:ce/aHR0cHM6Ly93d3cu/c2ltcGx5cHN5Y2hv/bG9neS5vcmcvd3At/Y29udGVudC91cGxv/YWRzL2NvcnJlbGF0/aW9uLmpwZw)

In [None]:
import numpy as np
import pandas as pd
# Show floats with 2 decimals
pd.options.display.float_format = '{:,.2f}'.format

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [None]:
fortune = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/Fortune_1000.csv")
fortune.head()

# <a id='toc2_'></a>[Covariance](#toc0_)

> Covariance measures how variables vary together. A positive covariance means that the variables vary together in the same direction, a negative covariance means they vary in the opposite direction and 0 covariance means that the variables don’t vary together or they are independent of each other.

![image.png](../../../img/cov_less_0.png) ![image](../../../img/cov_around_0.png) ![image](../../../img/cov_more_0.png)  

Source: [Wikipedia](https://en.wikipedia.org/wiki/Covariance)

In [None]:
# Initial covariance
np.cov(fortune.revenue, fortune.profit)

Why do you think there are NaNs in the covariance matrix?

In [None]:
# Fix dataframe and get covariance
fortune.dropna(subset ='profit', inplace=True)
np.cov(fortune.revenue, fortune.profit)

We can extract the covariance by getting the second element in the 1st row:

In [None]:
# Extract covariance
cov_matrix = np.cov(fortune.revenue, fortune.profit)
cov_matrix[0,1]

What does a positive covariance mean?

⚠️ **Covariance doesn't measure how strongly the variables vary together!** ⚠️ That's why we use correlation.

Example: With covariance, we can say that when revenue increases, profit also increases but not by how much.

## <a id='toc2_1_'></a>[Measures of correlation](#toc0_)

> Measures the relationship and the dependency between two variables. “Covariance” indicates the direction of the linear relationship > between variables. “Correlation” measures the direction and strength of the linear relationship between two variables. They can be > calculated by:

* Pearson: Measures the strength of **linear** correlation between 2 numerical continuous variables
* Spearman: Measures the strength of **non-linear** correlation between 2 numerical variables (discrete and/or continuous)
> * Kendall: similar to Spearman, but sample sizes are small and is less sensitive to outliers.

This [Stackexchange answer](https://datascience.stackexchange.com/a/64261) explains quite well the differences and assumptions of each correlation.

In [None]:
# Let's explore the relationship between profit and revenue like we did last time
plt.scatter(fortune.revenue, fortune.profit)
plt.xlabel('revenue')
plt.ylabel('profit')
plt.show()

### <a id='toc2_1_1_'></a>[Pearson correlation](#toc0_)

- Measures the strength of **linear** correlation between 2 numerical continuous variables

![pearson](https://imgs.search.brave.com/4_7sQIEbqLJM53We4YKYA13ibgVmnBKksdKFpRQvaLU/rs:fit:860:0:0/g:ce/aHR0cHM6Ly93d3cu/cXVlc3Rpb25wcm8u/Y29tL2Jsb2cvd3At/Y29udGVudC91cGxv/YWRzLzIwMjAvMDQv/UGVhcnNvbi1jb3Jy/ZWxhdGlvbi1jb2Vm/ZmljaWVudC0xLmpw/Zw)

In [None]:
# Check correlation matrix - default is pearson correlation
fortune.corr(numeric_only=True)

In [None]:
# Add styling
fortune_corr = fortune.corr()
fortune_corr.style.background_gradient(cmap='RdYlGn').format('{:,.2f}')

In [None]:
# Create heatmap in seaborn
sns.heatmap(fortune_corr, annot=True)

Let's remove the duplicated values in our plot:

In [None]:
# Create mask for upper triangle
mask = np.zeros_like(fortune_corr)
mask[np.triu_indices_from(mask)] = True

# Plot heatmap
sns.heatmap(fortune_corr, mask =mask, annot=True, cmap = 'RdYlGn')

### <a id='toc2_1_2_'></a>[Spearman correlation](#toc0_)

- Measures the strength of **non-linear** correlation between 2 numerical variables (discrete and/or continuous)

![spearman](https://imgs.search.brave.com/V76zc-JGOOuPAUO-S0KYm46FuQPrFY3cH4lX2kavgQw/rs:fit:860:0:0/g:ce/aHR0cHM6Ly93d3cu/c2ltcGxpbGVhcm4u/Y29tL2ljZTkvZnJl/ZV9yZXNvdXJjZXNf/YXJ0aWNsZV90aHVt/Yi9TcGVhcm1hbiVF/MiU4MCU5OXNfUmFu/a19Db3JyZWxhdGlv/bl8xLmpwZw)

In [None]:
# Check correlation matrix
round(fortune.corr(method='spearman'), 2)

In [None]:
# Add styling
fortune_corr = fortune.corr(method = 'spearman')
fortune_corr.style.background_gradient(cmap='RdYlGn').format('{:,.2f}')

In [None]:
# Create heatmap in seaborn without upper triangle
mask = np.zeros_like(fortune_corr)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(fortune_corr, mask=mask, annot=True, cmap='RdYlGn')



In [None]:
fortune['Market Cap'].dtype

In [None]:
fortune['Market Cap'] = pd.to_numeric(fortune['Market Cap'], errors='coerce')

In [None]:
fortune['Market Cap'].dtype

### <a id='toc2_1_3_'></a>[Kendall Tau correlation](#toc0_)

> * Kendall: similar to Spearman, but sample sizes are small and is less sensitive to outliers.

In [None]:
# Check correlation matrix
fortune.corr(method = 'kendall', numeric_only=True)
fortune_corr.style.background_gradient(cmap='RdYlGn').format('{:,.2f}')

In [None]:
# Add styling

fortune_corr.style.background_gradient(cmap='RdYlGn').format('{:,.2f}')

In [None]:
# Create heatmap in seaborn without upper triangle
mask = np.zeros_like(fortune_corr)
mask[np.triu_indices_from(mask)] = True

# <a id='toc3_'></a>[Correlation vs causation](#toc0_)

One good way to understand that correlation is not causation is to look at completely unrelated events and see that they have some degree of association, such as the [number of master degrees awarded in journalism and the number of solar panels in Malta](https://www.tylervigen.com/spurious/correlation/1874_masters-degrees-awarded-in-communication-journalism-and-related-programs_correlates-with_solar-power-generated-in-malta):  

![](../../../img/spurious-correlations.png)  

Finding a correlation between 2 variables is merely the first step in establishing a causal relationship. However, other conditions must also be met:
 - time sequence, i.e. the alleged cause must occur before the event
 - a plausible reasoning for causal relationship, or as Wikipedia puts it: "a plausible physical or information-theoretical mechanism for an observed effect to follow from a possible cause"
 - the "causal" variable needs to be the only variable that could be responsible for the observed effect, i.e. there shouldn't be any other common or alternative variables that could explain the effect (e.g. [confounders](https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704-ep713_confounding-em/bs704-ep713_confounding-em_print.html#:~:text=Identifying%20Confounding,for%20a%20potential%20confounding%20factor.))

Although the 2nd and 3rd items could be reasonably easy to fulfill, the 4th item takes considerably more thought (e.g. experimental design) and mathematical rigour. There is a whole field dedicated to causal relationships called Causal Analysis and whilst it's not yet very applied in ML, it did start to gain more traction - e.g. [Causal ML library](https://causalml.readthedocs.io/en/latest/about.html) and [Causal ML book](https://causalml-book.org/).

# <a id='toc4_'></a>[Extra: Correlation among different types of variables](#toc0_)

We typically look at correlation between numerical values only as it's the simplest to quantify. However, if you are interested in learning about different types of correlations, e.g. categorical & categorical data, numerical & categorical data, you can read [this article](https://archive.ph/20210208110902/https://medium.com/@outside2SDs/an-overview-of-correlation-measures-between-categorical-and-continuous-variables-4c7f85610365).

# <a id='toc5_'></a>[Extra: Covariance maths](#toc0_)

This is the covariance in typical maths syntax:  
  
![image.png](https://miro.medium.com/v2/resize:fit:828/0*Vf0PmWaZUL4CtPC_)

Where:
- Σ = big sigma greek letter. In maths, this means sum.
- d = In general maths, this is a differential (fancy for very, very small interval). In this context, it means the difference between a variable and its mean.
- n = sample size

It is called covariance because it looks at the variance of 2 variables instead of 1. If you remember, the variance was the squared sum of differences between elements and their mean:

In [None]:
shoe_sizes = [46, 45, 44.5, 44, 43, 43, 43, 43, 42.5, 42, 41, 40, 39, 38, 37]
heights = [185, 184, 183, 182, 181, 181, 174, 174, 173, 170, 169, 169, 165, 160, 160]

shoe_size_mean = np.mean(shoe_sizes)
height_mean = np.mean(heights)

In [None]:
variance = sum([(shoe_size - shoe_size_mean) ** 2 for shoe_size in shoe_sizes])
variance

We can also write the square as a product:

In [None]:
variance = sum([(shoe_size - shoe_size_mean) * (shoe_size - shoe_size_mean) for shoe_size in shoe_sizes])
variance

And we can see how the covariance is simply replacing the second difference with that of another variable:

In [None]:
covariance = sum([(shoe_size - shoe_size_mean) * (height - height_mean) for (shoe_size, height) in zip(shoe_sizes, heights)])
covariance

The covariance is positive, therefore when the height increases, the shoe size also increases and viceversa, which makes sense.

# <a id='toc6_'></a>[Extra: StatQuest Videos](#toc0_)

- [Covariance: Clearly Explained!](https://www.youtube.com/watch?v=qtaqvPAeEJY) - 22 min
- [Pearson Correlation: Clearly Explained!](https://www.youtube.com/watch?v=xZ_z8KWkhXE) - 19 min