### Correlation

Before we dive in, let's talk about relationships between numeric variables. We can visualize these kinds of relationships with scatter plots - in this scatterplot, we can see the relationship between the total amount of sleep mammals get and the amount of REM sleep they get. The variable on the x-axis is called the explanatory or independent variable, and the variable on the y-axis is called the response or dependent variable. 

We can also examine relationships between two numeric variables using a number called the correlation coefficient. This is a number between -1 and 1, where the magnitude corresponds to the strength of the relationship between the variables, and the sign, positive or negative, corresponds to the direction of the relationship. 

### Magnitude = strength of relationship

Here's a scatterplot of 2 variables, x and y, that have a correlation coefficient of 0-point-99. Since the data points are closely clustered around a line, we can describe this as a near-perfect or very strong relationship. If we know what x is, we'll have a pretty good idea of what the value of y could be. Here, x and y have a correlation coefficient of 0-point-75, and the data points are a bit more spread out. In this plot, x and y have a correlation of 0-point-56 and are therefore moderately correlated. A correlation coefficient around 0-point-2 would be considered a weak relationship. The sign of the correlation coefficient corresponds to the direction of the relationship. A positive correlation coefficient indicates that as x increases, y also increases. A negative correlation coefficient indicates that as x increases, y decreases. 

To visualize relationships between two variables, we can use a scatterplot. We'll use seaborn, which is a plotting package built on top of matplotlib. We import seaborn as sns, which is the alias commonly used for seaborn. We create a scatterplot using sns-dot-scatterplot, passing it the name of the variable for the x-axis, the name of the variable for the y-axis, as well as the msleep DataFrame to the data argument. Finally, we call plt-dot-show. 

In [None]:
import seaborn as sns
sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)

Adding a trendline

We can add a linear trendline to the scatterplot using seaborn's lmplot() function. It takes the same arguments as sns-dot-scatterplot, but we'll set ci to None so that there aren't any confidence interval margins around the line. Trendlines like this can be helpful to more easily see a relationship between two variables.


import seaborn as sns 
sns.lmplot(x="sleep_total", y="sleep_rem", data=msleep, ci=None) 
plt.show()

Computing correlaton 

To calculate the correlation coefficient between two Series, we can use the dot-corr method. If we want the correlation between the sleep_total and sleep_rem columns of msleep, we can take the sleep_total column and call dot-corr on it, passing in the other Series we're interested in. Note that it doesn't matter which Series the method is invoked on and which is passed in since the correlation between x and y is the same thing as the correlation between y and x. 

In [None]:
msleep['sleep_total'].corr(msleep['sleep_rem']) 

In [None]:
msleep['sleep_rem'].corr(msleep['sleep_total']) 

Many ways to calculate correlation

There's more than one way to calculate correlation, but the method we've been using in this video is called the Pearson product-moment correlation, which is also written as r. This is the most commonly used measure of correlation. Mathematically, it's calculated using this formula where x and y bar are the means of x and y, and sigma x and sigma y are the standard deviations of x and y. The formula itself isn't important to memorize, but know that there are variations of this formula that measure correlation a bit differently, such as Kendall's tau and Spearman's rho, but those are beyond the scope of this course. 

Exercise: 

Relationships between variables

In this chapter, you'll be working with a dataset world_happiness containing results from the 2019 World Happiness Report. The report scores various countries based on how happy people in that country are. It also ranks each country on various societal aspects such as social support, freedom, corruption, and others. The dataset also includes the GDP per capita and life expectancy for each country.

In this exercise, you'll examine the relationship between a country's life expectancy (life_exp) and happiness score (happiness_score) both visually and quantitatively. seaborn as sns, matplotlib.pyplot as plt, and pandas as pd are loaded and world_happiness is available.

In [None]:
# Create a scatterplot of happiness_score vs. life_exp and show
____

# Show plot
____

Correlation cavets 

Consider this data. There is clearly a relationship between x and y, but when we calculate the correlation, we get 0-point-18. This is because the relationship between the two variables is a quadratic relationship, not a linear relationship. The correlation coefficient measures the strength of linear relationships, and linear relationships only. Just like any summary statistic, correlation shouldn't be used blindly, and you should always visualize your data when possible

In [None]:
df['x'].corr(df['y']) 


Body weight vs. awake time

Here's a scatterplot of each mammal's body weight versus the time they spend awake each day. The relationship between these variables is definitely not a linear one. The correlation between body weight and awake time is only about 0-point-3, which is a weak linear relationship. 

In [None]:
msleep['bodywt'.corr(msleep['awake']) 

Distribution of body weight

If we take a closer look at the distribution for bodywt, it's highly skewed. There are lots of lower weights and a few weights that are much higher than the rest. 

Log transformation

When data is highly skewed like this, we can apply a log transformation. We'll create a new column called log_bodywt which holds the log of each body weight. We can do this using np-dot-log. If we plot the log of bodyweight versus awake time, the relationship looks much more linear than the one between regular bodyweight and awake time. The correlation between the log of bodyweight and awake time is about 0-point-57, which is much higher than the 0-point-3 we had before. 

In [None]:
msleep['log_bodywt'] = np.log(msleep['bodywt']) 

sms.lmplot(x = 'log_bodywt',
                     y = 'awake', 
                     data=msleep 
                     ci=None) 
plt.show() 

In [None]:
msleep['log_bodywt'].corr(msleep['awake']) 

Other transformations

In addition to the log transformation, there are lots of other transformations that can be used to make a relationship more linear, like taking the square root or reciprocal of a variable. The choice of transformation will depend on the data and how skewed it is. These can be applied in different combinations to x and y, for example, you could apply a log transformation to both x and y, or a square root transformation to x and a reciprocal transformation to y. 

Why use a transformation?

So why use a transformation? Certain statistical methods rely on variables having a linear relationship, like calculating a correlation coefficient. Linear regression is another statistical technique that requires variables to be related in a linear manner, which you can learn all about in this course. 

Correlation does not imply causation

Let's talk about one more important caveat of correlation that you may have heard about before: correlation does not imply causation. This means that if x and y are correlated, x doesn't necessarily cause y. For example, here's a scatterplot of the per capita margarine consumption in the US each year and the divorce rate in the state of Maine. The correlation between these two variables is 0-point-99, which is nearly perfect. However, this doesn't mean that consuming more margarine will cause more divorces. This kind of correlation is often called a spurious correlation. 

Confounding

A phenomenon called confounding can lead to spurious correlations. Let's say we want to know if drinking coffee causes lung cancer. Looking at the data, we find that coffee drinking and lung cancer are correlated, which may lead us to think that drinking more coffee will give you lung cancer. However, there is a third, hidden variable at play, which is smoking. Smoking is known to be associated with coffee consumption. It is also known that smoking causes lung cancer. In reality, it turns out that coffee does not cause lung cancer and is only associated with it, but it appeared causal due to the third variable, smoking. This third variable is called a confounder, or lurking variable. This means that the relationship of interest between coffee and lung cancer is a spurious correlation. Another example of this is the relationship between holidays and retail sales. While it might be that people buy more around holidays as a way of celebrating, it's hard to tell how much of the increased sales is due to holidays, and how much is due to the special deals and promotions that often run around holidays. Here, special deals confound the relationship between holidays and sales. 