# Variance, Covariance and Correlation

## Introduction 

In this lesson, we shall look at how **Variance** of a random variable is used to calculate **Covariance** and **Correlation**, two key measures used in Statistics for finding the relationships between random variables. These measures help us identify the degree to which two sets of data tend to deviate from their expected value (i.e. mean), in a similar way. Based on these measures, we can identify if the two variables are dependent on each other, and to what extent. This lesson will help you develop a conceptual understanding, go through some necessary calculations and some precautions while using these measures. 

## Objectives

You will be able to

* Understand and explain data variance and how it relates to standard deviation
* Understand and calculate Covariance and Correlation between two random variables
* Visualize and interpret the results of Covariance and Correlation

We will also work through an interactive excercise for each of the three measures.

For the interactive excercises, we will need pandas and numpy installed on our machine. We can download these from the internet. This is easy enough.
Simply go to your terminal and type in **pip install pandas** followed by the enter key.
Then do **pip install numpy** followed by the enter key.
Or you can press shift enter on the cell below. If you already have pandas or numpy installed, you will see a message saying that it's already installed -- which you can safely ignore.

In [None]:
! pip install pandas
! pip install numpy

In [3]:
# import both packages
import pandas as pd
import numpy as np

# Here we create a dataframe that describes a quarterly gross domestic product (GDP) growth in percentages 
# and the company's new product line growth in percentages
qtr_data = pd.DataFrame(data = {'gdp_growth': [2, 3, 2.7, 3.2, 4.1], 'product_line_growth': [10, 14, 12, 15, 20]})
print('We will be using the following dataset throughout this notebook to understand Variance, Covariance and Correlation.')
print("This data describes a company's quarterly gross domestic product (GDP) growth in percentages and the company's new product line growth in percentages.")

qtr_data

We will be using the following dataset throughout this notebook to understand Variance, Covariance and Correlation.
This data describes a company's quarterly gross domestic product (GDP) growth in percentages and the company's new product line growth in percentages.


Unnamed: 0,gdp_growth,product_line_growth
0,2.0,10
1,3.0,14
2,2.7,12
3,3.2,15
4,4.1,20


Throughout this notebook we will be assuming that this dataset is not a sample but the population. 
Please keep this in in mind as we will be using this fact in the interactive excercises to calculate the three  statistical measures for the sake of simplicity.

## What is Variance ($\sigma^2$)

Before we talk about covariance , it is imperative that we get some idea around **Variance** of a random variable. Variance refers to the __spread of a data set__. 

> __Variance is a measure used to quantify how much a random variable deviates from its mean value__. 

When we calculate variance, we are essentially asking, "__Given the relationship of all given data points, how much distant from mean do we expect the next data point to be?__"  This "distance" is called the **error term**, and it's what variance is measuring. 

For example, these two data sets have exactly the same mean (10), but are obviously quite different: [0, 8, 12, 20] and [8, 9, 11, 12].<br>
So what is different about these two sets? It is the spread of the data that is different. The Variance ($\sigma^2$) of a data set is a measure of how spread out the data is.

Variance is shown using notation $\sigma^2$. Previously, we have seen $\sigma$ as a measure of standard deviation within a given dataset. Remember standard deviation is also a measure of spread of data. __Variance is simply the square of standard deviation (Or we could say standard deviation is is the square root of variance)__. 

### How to Calculate Variance? 

Variance is calculated by:
1. Taking the differences between each element in a data set and the mean, 
2. Squaring those differences to give it a positive value
3. Dividing the sum of the resulting squares by the number of values in the set.

$$\sigma^2 = \frac{\sum(x-\mu)^2}{n}$$

Here, 

$x$ represents an individual data point.

$\mu$ represents the mean of the data points.

$n$ is the total number of data points. 

Remember that while calculating a **sample variance** in order to estimate a population variance, the denominator of the variance equation becomes **n - 1**. This removes bias from the estimation, as it prohibits the researcher from underestimating the population variance. 

Following illustration summarizes how spread of data around mean (10) relates to the variance. Wider curves correspond to increasing variance.

<img src="images/var2.png" width=500>

### Interpreting Variance 

Variance cannot be negative. It can only be equal to or greater than 0. A variance value of zero represents that all of the values within a data set are identical. The larger the variance, the more spread in the data set. A large variance means that the numbers in a set are far from the mean and each other. A small variance means that the numbers are closer together in value. 

### Example Use Case

Consider the following graphs for Conglomo, Inc. and Bilco, Inc. These graphs show the theoretical frequency distributions of the monthly returns for each firm's common stock as though the returns were normally distributed.

<img src="images/var.png" width=400>

Conglomo's distribution of returns is more concentrated than Bilco's, as illustrated by Conglomo's relatively narrower bell curve. A more concentrated distribution is defined as having a smaller standard deviation. The distribution curve appears **higher, steeper, and narrower** because more observations are occurring **close to the expected return**. Bilco's distribution is rather **flat**, reflecting that its returns are less concentrated, or more dispersed, than those of Conglomo Inc.
Most people are risk averse, in that they wish to minimize the amount of risk they must endure to earn a certain level of expected return. __A risk-averse investor would clearly prefer Conglomo's stock because its distribution of returns is more concentrated around the expected value of return.__

### Interactive Excercise to understand Variance

Lets now calculate the variance of both variables in our dataset using the popular numerical python package known as **numpy**.<br>
Let us look at the dataset again.

In [14]:
qtr_data

Unnamed: 0,gdp_growth,product_line_growth
0,2.0,10
1,3.0,14
2,2.7,12
3,3.2,15
4,4.1,20


**Step 1**<br>
The first step is to calculate the mean of the variables. This can be done easily using the **mean()** method in numpy

In [7]:
# Calculate the mean of gdp_growth and product_line_growth
gdp_growth_mean = np.mean(qtr_data['gdp_growth'])
product_line_growth_mean = np.mean(qtr_data['product_line_growth'])

print('Mean value of gdp_growth is: {}'.format(gdp_growth_mean))
print('Mean value of product_line_growth is: {}'.format(product_line_growth_mean))

Mean value of gdp_growth is: 3.0
Mean value of product_line_growth is: 14.2


**Step 2**<br>
The next step is to calculate the difference of all values from their respective means and then square those values.

We can find the difference using broadcasting in numpy and then square the result.

In [13]:
# Calcualte the squared difference
gdp_growth_sqrd_diff = (qtr_data['gdp_growth'] - gdp_growth_mean)**2
product_line_growth_sqrd_diff = (qtr_data['product_line_growth'] - product_line_growth_mean)**2

print('Squared differences for gdp_growth: {}'.format(gdp_growth_sqrd_diff.values))
print('Squared differences for product_line_growth: {}'.format(product_line_growth_sqrd_diff.values))

Squared differences for gdp_growth: [1.   0.   0.09 0.04 1.21]
Squared differences for product_line_growth: [17.64  0.04  4.84  0.64 33.64]


**Step 3**<br>
The final step is to calculate the average of the squared differences. Remember that we assumed our dataset to be the population and not a sample. Hence we take the average (i.e. divide the sum by __n__ and not n-1).

Again, numpy's **mean()** method comes in handy here.

In [15]:
# Calculate the average of the squared differences. This is the variance.
gdp_growth_var = np.mean(gdp_growth_sqrd_diff)
product_line_growth_var = np.mean(product_line_growth_sqrd_diff)

print('Variance in gdp_growth is: {}'.format(gdp_growth_var))
print('Variance in product_line_growth is: {}'.format(product_line_growth_var))

Variance in gdp_growth is: 0.4679999999999998
Variance in product_line_growth is: 11.36


One thing that we can observe is that variance in the variable product_line_growth is more than the variance in the variable gdp_growth. This implies that data in product_line_growth is more spread out around its mean than the data in gdp_growth.

We can see that this is true by looking at the values in the dataset.

**Note**<br>
Now that we know how to calculate variance, numpy also provides a method to calculate the variance of a variable so that we don't have to go through the entire process of finding variance. Pretty handy huh!

The method is called **var()** and is used as follows. Note that it produces the exact same results as we did.

In [19]:
# Calculate variance using the numpy.var() method
gdp_growth_var = np.var(qtr_data['gdp_growth'])
product_line_growth_var = np.var(qtr_data['product_line_growth'])

print('Variance in gdp_growth is: {}'.format(gdp_growth_var))
print('Variance in product_line_growth is: {}'.format(product_line_growth_var))

Variance in gdp_growth is: 0.4679999999999998
Variance in product_line_growth is: 11.36


## Covariance ($\sigma_{xy}$)

Now that we know what variance is and what quantity it measures, imagine calculating variance of two random variables to get some idea on how they change together (or stay the same) considering all included values.

In Stats, If we are trying to figure out how two random variables tend to **vary** together, we are effectively talking about **Covariance** between these variables. Covariance provides an insight into how two variables are __related__ to one another. 

More precisely, covariance refers to:
> The measure of how two random variables in a data set will __vary together__. 
  
### How to calculate Covariance ?
In essence, covariance is used to measure **how much variables vary together**, and its calculated using the formula:


$$ \large \sigma_{XY} = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)}{n}$$

Here $X$ and $Y$ are two random variables having n elements each. We want to caluclate ___how much $Y$ depends on $X$___ (or vice-versa), by measuring how values in $Y$ change with observed changes in $X$ values. 

> This makes $X$ our __independent variable__ and $Y$, the __dependent variable__.  

$x_i$ = ith element of variable $X$

$y_i$ = ith element of variable $Y$

$n$ = number of data points (__$n$ must be same for $X$ and $Y$__)

$\mu_x$ = mean of the independent variable $X$

$\mu_y$ = mean of the dependent variable $Y$

$\sigma_{XY}$ = Covariance between $X$ and $Y$

*We can see that above formula calculates the covariance of $X$ and $Y$ (check the variance formula above) by multiplying the variance of each of their corresponding elements. Hence the term __Co-Variance__.* Again, in case of a **sample** Co-Variance, we replace the denominator by **n-1**.

### Interpreting Covariance values 

* A positive covariance indicates that **higher than average** values of one variable tend to pair with higher than average values of the other variable. This essentially means that an increase in one variable results in an increase in the other variable.

* Negative covariance indicates that higher than average values of one variable tend to pair with **lower than average** values of the other variable. This essentially means that a decrease in one variable results in an opposite change in the other variable.

* A zero value, or values close to zero indicate no covariance, i.e. no values from one variable can be paired with values of second variable. 

This behavior can be further explained using the scatter plots below
<img src="images/covariance.gif" width=500>



A large negative covariance value show an inverse relationship between values at x and y axes. i.e. y decreases as x increases. This is shown by the scatter plot on the left. The middle scatter plot shows values spread all over the plot, reflecting the fact that variables on x and y axes can not be related in terms of how they vary together. The covariance value for such variables would be very close to zero. 

In the scatter plot on right, we see a strong relationship between values at x and y axes i.e. y increases as x increases. 

>__Covariance is not standardized. Therefore, covariance values can range from negative infinity to positive infinity.__

### Interactive Excercise to understand Co-Variance

Let us now calculate the co-variance of the two variables in our dataset.<br>
Here is the dataset again.

In [20]:
qtr_data

Unnamed: 0,gdp_growth,product_line_growth
0,2.0,10
1,3.0,14
2,2.7,12
3,3.2,15
4,4.1,20


**Step 1.**<br>
The first step is to calculate the mean of the variables. This can be done easily using the **mean()** method in numpy.

In [21]:
# Calculate the mean of both the variables
gdp_growth_mean = np.mean(qtr_data['gdp_growth'])
product_line_growth_mean = np.mean(qtr_data['product_line_growth'])

print('Mean value of gdp_growth is: {}'.format(gdp_growth_mean))
print('Mean value of product_line_growth is: {}'.format(product_line_growth_mean))

Mean value of gdp_growth is: 3.0
Mean value of product_line_growth is: 14.2


**Step 2.**<br>
The next step is to calculate the difference of all values from their respective means and then multiply the respective differences of both variables with each other. Note that unlike variance, here instead of squaring the differences (i.e. multiplying the difference values of a variable with itself), we are multiplying the difference values of both variables with each other.

We can find the difference using broadcasting in numpy and then multiply the respective values using broadcasting again.

In [26]:
# Calculate the difference of all values from their respective means 
gdp_growth_diff = qtr_data['gdp_growth'] - gdp_growth_mean
product_line_growth_diff = qtr_data['product_line_growth'] - product_line_growth_mean

print('Difference from mean for gdp_growth: {}'.format(gdp_growth_diff.values))
print('Difference from mean for product_line_growth: {}'.format(product_line_growth_diff.values))

# Multiply the respective differences of both variables with each other
product = gdp_growth_diff*product_line_growth_diff

print('Product of differences of both variables: {}'.format(product.values))

Difference from mean for gdp_growth: [-1.   0.  -0.3  0.2  1.1]
Difference from mean for product_line_growth: [-4.2 -0.2 -2.2  0.8  5.8]
Product of differences of both variables: [ 4.2  -0.    0.66  0.16  6.38]


**Step 3.**<br>
The final step is to calculate the average of the products. Again, remember that we assumed our dataset to be the population and not a sample. Hence we take the average (i.e. divide the sum by **n** and not n-1).

Again, numpy's **mean()** method comes in handy here.

In [27]:
# Calculate the average value of the products that we calculated in the previous step. This is the covariance.
covariance = np.mean(product)

print('Covariance between gdp_growth and product_line_growth: {}'.format(covariance))

Covariance between gdp_growth and product_line_growth: 2.28


We see that there is a positive covariance between the two variables. So, we can say that growth of the company's new product line has a positive relationship with quarterly GDP growth. 

**Note**<br>
Now that we know how to calculate co-variance, numpy also provides a method to calculate the co-variance between two variables so that we don't have to go through the entire process of finding co-variance. Pretty handy huh!<br>
The method is called **cov()** and is used as follows. It returns a covariance matrix whose diagonal elements represent the **variance** in the variables. The non diagonal elements represent **co-variance** between the two variables.

Note that it produces the exact same results as we did. The diagnonal elements are the variances in gdp_growth and product_line_growth. 

In [35]:
# Calculate covariance using the numpy.cov() method. 
# Remember we mentioned that we will be assuming that this dataset is not a sample but the population.
# That is the reason for setting ddof equal to 0. If the dataset were a sample, ddof would be equal to 1.
covariance = np.cov(qtr_data['gdp_growth'], qtr_data['product_line_growth'], ddof= 0)

print('Covariance matrix:\n {}'.format(covariance))

Covariance matrix:
 [[ 0.468  2.28 ]
 [ 2.28  11.36 ]]


## Correlation 

Above, we saw how covariance can identify the degree to which two random variables tend to vary together, while using a formulation that depends on the units of the variables $X$ and $Y$. So, if different experiments contain underlying data measured in different units, covariance measure will not produce comparable results. For this, we need to normalize this degree of variation into a standard unit, with interpretable results independent of the units of data. We achieve this with a derived normalized measure, called correlation. 

Correlation is defined as covariance, normalized by the product of standard deviations of $X$ and $Y$. This normalization helps us set the scale from -1 to 1. So the correlation between 𝑋 and 𝑌 would be calculated as:

$$Correlation(X,Y) = \frac{\sigma_{XY}}{\sigma_X\sigma_Y}$$

>When two random variables **Correlate**, this reflects that the change in one item **effects** the change in the values of second variable. 

In data science practice, we typically tend to look at correlation rather than covariance because it is more interpretable, since it does not depend on the scale of either random variable involved.

### Types of Correlation Measures

__Coefficient of correlation__, r, called the linear correlation coefficient, measures the strength and the direction of a linear relationship between two variables. It is also called as __Pearson correlation coefficient__. 

In statistics, we measure four types of correlations for detailed relationship analysis: 
* Pearson correlation 
* Kendall Rank correlation 
* Spearman correlation
* Point-Biserial correlation. 


For now, we shall focus on Pearson correlation as it is the go-to correlation measure for most needs. 

__Pearson r__ correlation is the most widely used correlation statistic to measure the degree of the relationship between two linearly related variables. For the Pearson r correlation, both variables should be normally distributed (normally distributed variables have a bell-shaped curve). Other assumptions include linearity and homoscedasticity. Linearity assumes a straight line relationship between each of the two variables and homoscedasticity assumes that data is equally distributed about the regression line.


### Calculating Coefficient of Correlation (r)

Pearson Correlation (r) is calculated using following formula :

$$ r = \frac{\sum_{i=1}^{n}(x_i -\mu_x)(y_i - \mu_y)} {n\sqrt{(\sum_{i=1}^{n}x_i - \mu_x)^2 (\sum_{i=1}^{n}y_i-\mu_y)^2}}$$

So just like in the case of covariance,  $X$ and $Y$ are two random variables having n elements each. 


$x_i$ = ith element of variable $X$

$y_i$ = ith element of variable $Y$

$n$ = number of data points (__$n$ must be same for $X$ and $Y$__)

$\mu_x$ = mean of the independent variable $X$

$\mu_y$ = mean of the dependent variable $Y$

$r$ = Calculated Pearson Correlation


Here $X$ and $Y$ are the random variables, $\mu_x$ and $\mu_y$ are the mean values for both $X$ and $Y$. A detailed mathematical insight into this equation is available [in this paper](http://www.hep.ph.ic.ac.uk/~hallg/UG_2015/Pearsons.pdf)

### Interpreting Correlation values

> __Correlation formula shown above always gives values in a range between -1 and 1__

If two variables have a correlation of +0.9,  this means the change in one variable results in an almost similar change in the other variable. A correlation value of -0.9 means that the change is one variable results as an opposite change in the other variable. A pearson correlation near 0 would be no effect. Here are some example of pearson correlation calculations as scatter plots. 
<img src = 'images/pearson_2.png'>

Think about stock markets in terms of correlation. All the stock market indexes tend to move together in similar directions. When the DOW Jones loses 5%, the S&P 500 usually loses around 5%. When the DOW Jones gains 5%, the S&P 500 usually gains around 5% because they are **highly correlated**.

On the other hand, there could also be negative correlation where you might observe that as the DOW Jones loses 5% of it value, Gold might gain 5%. Alternatively, if the Dow Jones gains 5% of its value, Gold may lose 5% of its value. That's **negative correlation**. 

**Note**<br>
Correlation does not imply Causation. Just because the values of two variables change together does not mean that one variable **causes** change in the other variable. Consider the following example:<br>
__Ice cream sales are correlated with homicides in New York (Study)__<br>
As the sales of ice cream rise and fall, so do the number of homicides. Does this mean consumption of ice cream **causes** the death of people?<br>
No. Just because two things are correlated doesn't mean one causes the other.


### Use Cases


#### Social Media and Websites
Digital publishers want to maximize their understanding of the potential relationship between social media activity and visits to their website. For example, the digital publisher runs the correlation report between hourly Twitter mentions and visits for a two week period. The correlation is found to be r = 0.28, which indicates a medium, positive relationship between Twitter mentions and website visits.

#### Optimization for E-retailers
E-retailers are interested in driving increased revenue. For example, an e-retailer wants to compare a number of secondary success events (e.g., file downloads, product detail page views, internal search click-throughs, etc.) with weekly web revenue. They can quickly identify internal search click-throughs as having the highest correlation, which may indicate an area for optimization.

### Interactive Excercise for Correlation

Let us now calculate the coefficient of correlation of the two variables in our dataset.<br>
Here is the dataset again:

In [36]:
qtr_data

Unnamed: 0,gdp_growth,product_line_growth
0,2.0,10
1,3.0,14
2,2.7,12
3,3.2,15
4,4.1,20


**Step 1.**<br>
The first step is to calculate the mean of the variables. This can be done easily using the **mean()** method in numpy

In [37]:
# Calculate the mean of the variables
gdp_growth_mean = np.mean(qtr_data['gdp_growth'])
product_line_growth_mean = np.mean(qtr_data['product_line_growth'])

print('Mean value of gdp_growth is: {}'.format(gdp_growth_mean))
print('Mean value of product_line_growth is: {}'.format(product_line_growth_mean))

Mean value of gdp_growth is: 3.0
Mean value of product_line_growth is: 14.2


**Step 2.**<br> 
The next step is to calculate the difference of all values from their respective means and then multiply the respective differences of both variables with each other. We then calculate the average value of the products to find the covariance between the two variables just like we did in the interactive excercise for covariance.

We can find the difference using broadcasting in numpy and then multiply the respective values using broadcasting again.

In [60]:
# Calculate the difference of all values from their respective means
gdp_growth_diff = qtr_data['gdp_growth'] - gdp_growth_mean
product_line_growth_diff = qtr_data['product_line_growth'] - product_line_growth_mean

print('Difference from mean for gdp_growth: {}'.format(gdp_growth_diff.values))
print('Difference from mean for product_line_growth: {}'.format(product_line_growth_diff.values))
print('\n')

# Multiply the respective differences of both variables with each other
product = gdp_growth_diff * product_line_growth_diff
print('Product of differences of both variables: {}'.format(product.values))
print('\n')

# Calculate the average value of the products to find the covariance between the two variables
covariance = np.mean(product)
print('Covariance between the two variables is: {}'.format(covariance))

Difference from mean for gdp_growth: [-1.   0.  -0.3  0.2  1.1]
Difference from mean for product_line_growth: [-4.2 -0.2 -2.2  0.8  5.8]


Product of differences of both variables: [ 4.2  -0.    0.66  0.16  6.38]


Covariance between the two variables is: 2.28


**Step 3.**<br> 
We will also calculate the standard deviations of both the variables. Remember that standard deviation is just the square root of the variance.

We can get the standard deviations of the variables by just taking the square root of their variances.

In [57]:
# Calculate the standard deviations of the variables by taking the square root of their variances
gdp_growth_std = np.var(qtr_data['gdp_growth'])**(1/2)
product_line_growth_std = np.var(qtr_data['product_line_growth'])**(1/2)

print('Standard Deviation of gdp_growth: {}'.format(gdp_growth_std))
print('Standard Deviation of product_line_growth: {}'.format(product_line_growth_std))

Standard Deviation of gdp_growth: 0.6841052550594826
Standard Deviation of product_line_growth: 3.370459909270543


**Step 4.**<br>
The final step is to normalize the covariance by the product of the standard deviations of the two variables

In [61]:
# Normalize the covariance by the product of the standard deviations of the two variables 
# to get the correlation coefficient
r = covariance / (gdp_growth_std*product_line_growth_std)

print('Correlation coefficient of gdp_growth and product_line_growth is: {}'.format(r))

Correlation coefficient of gdp_growth and product_line_growth is: 0.9888325519611423


We can see that the correlation coefficient is 0.988 which indicates a highly positive relationship between the company's new product line and the quarterly GDP growth. Note that correlation indicates the strength of the relationship between the two variables since it's value is constrained to be between -1 and 1.

Here we have a very strong positive relationship.

**Note**<br>
Now that we know how to calculate the correlation coefficient, numpy also provides a method to calculate the correlation coefficient between two variables so that we don't have to go through the entire process of calculating it ourselves. Pretty handy huh! The method is called **corrcoef()** and is used as follows. 

It returns a correlation matrix whose diagonal elements represent the correlation of a variable with itself. The correlation of a variable with itself will always be 1. The non diagonal elements represent the correlation between the two variables.

Note that it produces the exact same results as we did.

In [62]:
np.corrcoef(qtr_data['gdp_growth'], qtr_data['product_line_growth'])

array([[1.        , 0.98883255],
       [0.98883255, 1.        ]])

### So how do these measures relate to each other ?

Are Covariance and Correlation the same thing? Simply put, no.

While both covariance and correlation indicate whether variables are positively or inversely related to each other, they are not considered to be the same. This is because correlation also informs about the degree to which the variables tend to move together. 

Covariance is used to measure variables that have different units of measurement. By leveraging covariance, analysts are able to determine whether the variables are increasing or decreasing, but they are unable to solidify the degree to which the variables are moving together due to the fact that covariance does not use one standardized unit of measurement.

Correlation, on the other hand, standardizes the measure of interdependence between two variables and informs researchers as to how closely the two variables move together.

## Summary
In this lesson, we looked at Identifying the variance of random variables as a measure of mean deviation. We saw how this measure can be used to first calculate covariance, followed by the correlation to analyze how change in one variable effects the change of another variable. Next, we shall see how we can use correlation analysis to run a __regression analysis__ and later, how covariance calculation helps us with dimensionality reduction. 