# Correlation


This chapter is adapted from Matthew Crump's excellent [Answering questions with data](https://crumplab.github.io/statistics/) book.  The text has mainly been left intact with a few modifications, also the code adapted to use python and jupyter.



>"Correlation does not equal causation." ---Every statistics and research methods instructor ever




<span style="color:red">
    
### NOTE: this notebook loads data from a file called 'WHR2018.csv' that is in Brightspace / Content / Data /
    
</span>

In [None]:
import numpy.random as npr
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import pingouin as pg


#### Stats packages

The core python library providing stats (and many other tools) is scipy. In addition, we will use pingouin, which is built on top of scipy and provides an easier to use interface for some stats tests and well as sometimes providing more useful results output than scipy

### Correlation background

##### Understanding the relationship between measurements

One of the fundamental ways we try to use data is to measure the relationship between measurements of different kinds. For example, does socioeconomic status have a relationship with academic outcomes?

When we have two variables, each with a number of matched observations, we can ask whether the two variables are **correlated**. 

Although correlation does not equal causation, causation should be reflected in correlation, and one of our main goals in Psychology is to use theory and data to understand the causal relationship between proposed constructs. In this section we will warm up with covariance, correlation, and how to assess these statistics in Python.

#### Charlie and the Chocolate factory

Imagine the following:

- a person's supply of chocolate has a causal influence on their level of happiness

- the more chocolate you have the more happy you will be, and the less chocolate you have, the less happy you will be

- we suspect happiness is caused by lots of other things in a person's life, we anticipate that the relationship between chocolate supply and happiness won't be perfect

What do these assumptions mean for how the data should look?

Our first step is to collect some imaginary data from 100 people:  

1. how much chocolate do you have
2. how happy are you. 

For convenience, both the scales will go from 0 to 100. 

- chocolate: 0 means no chocolate, 100 means lifetime supply of chocolate. Any other number is somewhere in between. 

- happiness: 0 means no happiness, 100 means all of the happiness, and in between means some amount in between.

Here is some sample data from the first 10 imaginary subjects.

We will use several numpy (imported as np) functions to help us generate random behavior.

`np.arange()`  gives us a range of numbers between a start and a stop value:

In [None]:
subject = np.arange(1,101)
subject[1]

The numpy package includes a subpackage called `numpy.random`. We import that as `npr` to simplify accessing the random functions.

`import numpy.random as npr`


The `npr.uniform()` function gives us a random number(s) between a min and a max value, and all values are equally likely, or have _uniform_ probability of being sampled:

`npr.uniform(min, max, n_samples)`

If you leave the `n_samples` argument out you'll get one number.

In [None]:
# get three random numbers between 1 and 5
print(npr.uniform(1,5,3))

# get one random number between 1 and 5
print(npr.uniform(1,5))

In [None]:
# get 1000 samples between .5 and 1
lots_of_random_numbers = npr.uniform(0.5, 1, 100000)

# make a histogram of the data
sns.displot(lots_of_random_numbers)

#### Note that whereas we previously used seaborn (sns) to plot data from dataframes, you can also use numpy arrays directly as the input the way that we did in this previous histogram example

The `np.round()` function takes in a number (or an array of numbers), and an optional argument for how many decimal places to round to. By default it rounds to zero decimals.

In [None]:
# round to no decimal places by leaving the decimals= argument out
np.round(1.2876)

In [None]:
# round to four decimal places
np.round(1.2876, decimals = 4)

Now we'll use some of those tools to make some fake data about chocolate consumption and happiness.

In [None]:
# make a set of participant numbers, 1-100:
participants = np.arange(1,101)

In [None]:
# get a bunch of 'chocolate' measurements by getting 
# a bunch of random numbers between .5 and 1 and mutiply them
# times a range of numbers going from 0-99.
#
# This will give us a set of increasing numbers with 
# some variability. the arange() gives us linear increase with no noise
# and multiplying times the uniform .5 to 1 gives some jitter to make
# the relationship not be a perfect correlation

# make some chocolate scores
chocolate = np.round(np.arange(100)*npr.uniform(0.5, 1, 100))

# plot the chocolate scores as y values vs their
# index position (np.arange(100))
sns.relplot(x=np.arange(100), y=chocolate)



In [None]:
# do the same thing to get some happiness data as well
happiness = np.round(np.arange(100)*npr.uniform(0.5, 1, 100))

In [None]:
# use a dictionary to make a dataframe with columns subject, chocolate, and happiness
# we'll set the index to be the 'participant' column:
df_CC = pd.DataFrame({'participant': participants, 
                     'chocolate': chocolate, 
                     'happiness': happiness})

df_CC.head()

We asked each subject two questions so there are two scores for each subject, one for their chocolate supply, and one for their level of happiness. 

To look at the relationship between these scores we can plot them.

### Scatter plots

When you have two measurements worth of data, you can always turn them into in a scatter plot. 

A scatter plot has a horizontal x-axis, and a vertical y-axis. You choose which measurement goes on which axis. Let's put chocolate supply on the x-axis, and happiness level on the y-axis. 

We will use seaborn (imported as sns) scatterplot to look at the data

In [None]:
# note the use of method 'chaining' to 
# attach .set_title() to the end of the 
# scatterplot call
sns.scatterplot(x='chocolate', 
                y='happiness', 
                data=df_CC).set_title('Scatterplot of happiness versus chocolate')


Each dot is for one person: 100 people, 100 dots. 

Each dot has an x-coordinate for chocolate and a y-coordinate for happiness.

You can look at any dot, then draw a straight line down to the x-axis: that will tell you how much chocolate that subject has. You can draw a straight line left to the y-axis: that will tell you how much happiness the subject has.

In this plot happiness is lower for people with smaller supplies of chocolate, and higher for people with larger supplies of chocolate. This kind of relationship is called a **positive correlation**. 

### Positive, Negative, and No-Correlation

To simulate a negative relationship we will take advantage of a numpy.arange() feature that lets us get a range of numbers in reverse order by specifiying the "step size" to be negative.

Numbers increasing from 0 to 100 in steps of 1:
np.arange(0,100)

Numbers descending from 100 to 0 in steps of 1:
np.arange(100,0,-1)

In [None]:
np.arange(10,0,-.876)

The next cell has a bunch of code that simply makes a positive correlation, a negative correlation, and a random correlation.

In [None]:

# make a positive relationship like we already did
subject=np.arange(100)
chocolate=np.round(np.arange(100)*npr.uniform(0.5, 1, 100))
happiness=np.round(np.arange(100)*npr.uniform(0.5, 1, 100))
df_CC_pos= pd.DataFrame({'subject': subject, 
                         'chocolate': chocolate, 
                         'happiness': happiness})


# make a negative relationship by making the happiness data step
# from 100 to using arange() with step size -1 (see below)
subject=np.arange(100)
chocolate=np.round(np.arange(100)*npr.uniform(0.5, 1, 100))
# use arange negative step size to go from 100 down to 0
happiness=np.round(np.arange(100,0,-1)*npr.uniform(0.5, 1, 100))
df_CC_neg= pd.DataFrame({'subject': subject, 
                         'chocolate': chocolate, 
                         'happiness': happiness})

# make a random relationship by just getting random numbers
# 0 to 100 for chocolate and happiness
subject=np.arange(100)
chocolate=np.round(npr.uniform(0, 100, 100))
happiness=np.round(npr.uniform(0, 100, 100))
df_CC_zero= pd.DataFrame({'subject': subject, 
                          'chocolate': chocolate, 
                          'happiness': happiness})


## Using matplotlob to make subplots of our pos, neg, and random correlations

In [None]:
# this first part uses maplotlib (plt) to set up a figure with three subplot panels
# the three subplot panels are the axes of the figure
# the 1,3 means 1 row of panels and three columns
fig, ax = plt.subplots(1, 3, 
                       figsize=(12, 4), 
                       sharex='all', 
                       sharey='all')

# The ax variable is a handle, or pointer, to the individual axes 
# (reminder: axes are the individual plots) and 
# gives us a way to put data into each of the three axes 
# when we make the plots


# tell seaborn to put the scatter plot into the first subplot by
# including the ax= argument to scatterplot(). the ax variable has three 
# entries, one for each subplot panel
sns.scatterplot(x='chocolate', 
                y='happiness', 
                data=df_CC_pos, 
                ax=ax[0]).set_title('positive')

# put the negative correlation into the second (middle) panel (ax[1])
sns.scatterplot(x='chocolate', 
                y='happiness', 
                data=df_CC_neg, 
                ax=ax[1]).set_title('negative')

# put the negative correlation into the third (right) panel (ax[2])
sns.scatterplot(x='chocolate', 
                y='happiness', 
                data=df_CC_zero,
                ax=ax[2]).set_title('random')


## Pearson's r

In past weeks we did descriptive statistics for a single measure, like chocolate, or happiness (i.e., means, variances, etc.).

A statistic that summarizes the relationship between two variables is "Pearson's $r$", and that's the r value that specifies the strength of a correlation.


Correlation scores range between -1 (perfect negative correlation) and +1 (positive correlation)

Let's take a look at a formula for Pearson's $r$:

$r = \frac{cov(X,Y)}{\sigma_{X}\sigma_{Y}} = \frac{cov(X,Y)}{SD_{X}SD_{Y}}$

$\sigma$ is often used as a symbol for the standard deviation (SD). In words, $r$ is the co-variance of X and Y, divided by the product of the standard deviation of X and the standard deviation of Y. This operation has the effect of **normalizing** the co-variance into the range -1 to 1. 

The formula for the co-variance is:

$cov(X,Y) = \frac{\sum_{i}^{n}(x_{i}-\bar{X})(y_{i}-\bar{Y})}{N}$



> It's worth saying that there are other formulas for computing Pearson's $r$. You can find them by Googling them. They will all give the same answer but vary in how it's calculated 

## Examples with Data

Let's look at some data from the [world happiness report](http://worldhappiness.report) (2018).

Data are in a csv file ('WHR2018.csv') that should be placed in the same folder as this notebook in order to read it into pandas. Or, make sure you include the **path** to the file along with the filename if you save it somewhere else.


This report measured various attitudes across people from different countries. 

For example, one question asked about how much freedom people thought they had to make life choices. Another question asked how confident people were in their national government. 

Here is a scatterplot showing the relationship between these two measures. Each dot represents means for different countries.

In [None]:
# use pandas to read the csv file and store it as a dataframe in the variable whr_df
whr_df = pd.read_csv('../../../data/stats_data/WHR2018.csv', sep = ',')
whr_df.head()

In [None]:
# take a subset of the columns in the original dataframe, 
# and use .dropna() to get rid of any rows 
# in the dataframe that are missing data for some measurement

# list of column names we want to keep in the df
cols_to_keep = ['country', 
                'Freedom to make life choices', 
                'Confidence in national government']

# make a smaller dataframe using only our
# desired columns
# We simply pass a list of columns we want
smaller_df = whr_df[cols_to_keep].copy()
smaller_df.shape

In [None]:
# use the dataframe dropna() function with argument inplace=True 
# to do the change in the existing dataframe (rather than outputting to a new variable):
smaller_df.dropna(inplace=True)
smaller_df.shape

In [None]:
# use seaborn regplot() to make a scatterplot but also 
# include the regression line showing the best fitting
# line mapping x values to y values
# regplot() follows the conventional seaborn structure
# where we can pass dataframe data by specifying column names
# for x= and y= and then point data= at the dataframe variable

# the line_kws and scatter_kws are options to change the color of the
# regression line to blue and change the dots to size = 2 and color = black
sns.regplot(x='Freedom to make life choices', 
            y='Confidence in national government', 
            data=smaller_df,
            line_kws={'color':'steelblue'}, 
            scatter_kws={'s': 2, 'color': 'black'})


We put a blue line on the scatterplot to summarize the positive relationship.

The actual correlation, as measured by Pearson's $r$ can be obtained using pinguoin or scipy

### Correlation using pinguoin

In [None]:
# use pinguoin (imported as pg) to compute the correlation
# it takes in a set of x values, a set of y values, and the type of correlation
corr_results = pg.corr(x=smaller_df['Freedom to make life choices'], 
                       y=smaller_df['Confidence in national government'], 
                       method='pearson')

corr_results['r'].values[0]

Pingouin corr() function returns a dataframe:

In [None]:
print(type(corr_results))

In [None]:
corr_results

In [None]:
# to access a column of the dataframe we simply do
# corr_results[column_name]
#
# doing that will return a pandas Series
# To get access to the actual values we do:
# df[column_name].values
# and that will give us a numpy array
# which we can index like a list:
print(corr_results['r'].values)
print(corr_results['r'].values[0])

In [None]:
# we can dynamically populate a little stats reporting text like this:

# get the degrees of freedom by getting sample size minus 2
deg_free = corr_results['n'][0]-2

# get the r value
r = corr_results['r'][0]
# round it to two decimal places
r = np.round(r,2)

# get the p value
p = corr_results['p-val'][0]

# the p-value for this test is super small (4.1e-57)
# this is scientific notation and means move the decimal places
# 57 places to the left!

# let's check the magnitude of the p value and adjust our stats reporting accordingly:
if p < .0001:
    p_text = 'p < .0001'
    sig = True

elif p < .001:
    p_text = 'p < .001'
    sig = True
elif p < .05:
    p_text = 'p < .05'
    sig = True
    
# if p value is not significant at one of the levels
# we can just print it out, rounded to two places:
else:
    p_text = f'p = {np.round(p, 2)}'
    sig = False


# report the correlation based on whatever the results were
var1 = 'Freedom to make life choices'
var2 = 'Confidence in government'
if sig:
    stats_text = f'The correlation between {var1} and {var2} was significant (r({deg_free}) = {r}, {p_text}.)'


print(stats_text)

Something to keep in mind when we think about correlation and the relationship between variables. Looking at the graph you might start to wonder: Does freedom to make life choices cause changes how confident people are in their national government? Our does it work the other way? Does being confident in your national government give you a greater sense of freedom to make life choices? Or, is this just a random relationship that doesn't mean anything? All good questions. These data do not provide the answers, they just suggest a possible relationship.

### Correlation using scipy


scipy.stats library includes a pearonr() function that takes in two sets of numbers to compute the pearson's r value for.

It returns a "tuple" which is like a list. The first entry is the r value and the second is the p value.

In [None]:


scipy_corr = stats.pearsonr(smaller_df['Freedom to make life choices'], 
                            smaller_df['Confidence in national government'])

# get the r value
print(scipy_corr[0])

# get the p value
print(scipy_corr[1])

A nice thing about pingouin vs scipy is that pingouin gives you some additional info with the p and r values.

## Interpreting Correlations

- #### Correlation does not equal causation

- #### And even when there is causation there might no be obvious correlation

Consider buying a snake plant for your home. Like most plants, snake plants need some water to stay alive. However, they also need just the right amount of water. 

Imagine an experiment where 1000 snake plants were grown in a house. Each snake plant is given a different amount of water per day, from zero teaspoons of water per day to 1000 teaspoons of water per day. 

We will assume that water is part of the causal process that allows snake plants to grow. 

The amount of water given to each snake plant per day can of our measures. Every week the experimenter measures snake plant growth, which will be the second measurement. 

Now, can you imagine for yourself what a scatter plot of weekly snake plant growth by tablespoons of water would look like?

The first plant given no water at all would have a very hard time and eventually die. It should have the least amount of weekly growth. 

The plants given only a few teaspoons of water per day could get just enough water to keep the plants alive, so they will grow a little bit but not a lot. 

As we look at snake plants getting more and more water, we should see more and more plant growth, but only up to a point. Too much water can be bad for these plants. 

Data like this will produce a scatter plot with an upside down U shape.

Computing Pearson's $r$ for data like this can give you $r$ values close to zero. The scatter plot could look something like this:

In [None]:
# np.linspace() takes a start and stop value and the number of desired steps
# and returns to you a set of numbers
# here we are asking for 1000 evenly spaced values between 0 and 1000
water = np.linspace(0,1000,1000)

# concatenate a range of numbers going from 0 to 10 and then a set going from 10 to 0
growth = np.concatenate(
    (np.linspace(0,10,500), 
     np.linspace(10,0,500)), 
    axis=None)

# randomly choose 1000 values from uniform distribution between -2 and 2
noise = npr.uniform(-2,2,1000)

# add some "noise" or randomness to the growth variable
growth = growth+noise

# make dataframe from dictionary:
snake_df = pd.DataFrame({"growth": growth, 
                         "water": water})

sns.scatterplot(x='water', 
                y='growth',
                data=snake_df).set_title('Imaginary snake plant growth as a function of water')


There is clearly a relationship between watering and snake plant growth. But, the correlation isn't in one direction. As a result, when we compute the correlation in terms of Pearson's r, we get a value suggesting no relationship.

In [None]:
corr_stats = pg.corr(snake_df['growth'], snake_df['water'])


corr_stats

We have an r value close to 0, and a p value way greater than .05.

There is no linear relationship that can be described by a single straight line. When we need lines or curves going in more than one direction, we have a nonlinear relationship.

This example illustrates some conundrums in interpreting correlations. 

> Pro Tip: This is one reason why plotting your data is so important. If you see an upside U shape pattern, then a correlation analysis is probably not the best analysis for your data.

### Correlation and Random chance

Another very important aspect of correlations is the fact that they can be produced by random chance. This means that you can find a positive or negative correlation between two measures, even when they have absolutely nothing to do with one another. These are **spurious** correlations, produced just by chance alone. 

Imagine a situation with no causal connection:

- two participants
- one at north pole with a lottery machine full of balls with number from 1 to 10
- one at the south pole with a similar machine
- each participant randomly chooses 10 balls and records the number


### Simulating a bunch of random correlation analyses:
Here is what the numbers on each ball could look like for each participant:

In [None]:
# randomly choose 10 numbers between 1 and 10
north_pole = np.round(npr.uniform(1,10,10))

# choose 10 more random numbers between 1 and 10
south_pole = np.round(npr.uniform(1,10,10))

# make dataframe from dictionary
df_poles = pd.DataFrame({'north_pole': north_pole, 
                         'south_pole': south_pole})

df_poles.head()

In [None]:
# use pingouin to compute the correlation
results = pg.corr(x=df_poles['north_pole'], 
                  y=df_poles['south_pole'])

# print the pearson's r value
results['r'][0]

In this one case, if we computed Pearson's $r$, we would find that $r =$ something. 

But, we know that relationship should be completely random, because that is how we set up the game.

The question is what can random chance do? If we ran our game over and over again thousands of times, each time choosing new balls, and each time computing the correlation, what would we find?

First, we will find fluctuation. The r value will sometimes be positive, sometimes be negative, sometimes be big and sometimes be small. 

Second, we will see what the fluctuation looks like. This will give us a window into the kinds of correlations that chance alone can produce. Let's see what happens.

In [None]:
# empty list to append to for 
# keeping track of simulated
# results
simulated_correlations = []

# do 1000 simulations of randomly shuffling
# ten numbers and computing the correlation
for sim in range(1000):
    north_pole = np.round(npr.uniform(1,10,10))
    south_pole = np.round(npr.uniform(1,10,10))

    # use scipy.stats pearson r
    r, p = stats.pearsonr(north_pole,south_pole)
    
    # keep track of the r value for this loop of the simulation
    simulated_correlations.append(r)


In [None]:
# use seaborn scatterplot, passing arrays of numbers to 
# x and y arguments (not using dataframe)
# we will also keep track of the handle or pointer
# to the plot in variable pl. This will be used to set labels
# for the x and y axis

# plot the correlation values against an index of 0-999
pl = sns.relplot(x=np.arange(1000), 
            y=simulated_correlations)

# add labels to x and y axis
# we have to do this because we didn't
# use a dataframe for the relplot() above
pl.set_axis_labels('sims', 'simulated correlation value')

# use matplotlib to add green lines at y position 1 and
# y position -1

# the firs input is x values (0, 1000) and the second
# is y values
# plt.plot() makes a line connecting points, so we are making
# a line that connects the x,y point= (0, 1) to the x,y point (1000, 1)
# and another line that connects (0, -1) and (1000, -1)
plt.plot([0, 1000], [1, 1], linewidth=2, color='green')
plt.plot([0, 1000], [-1, -1], linewidth=2, color='green')



Each dot in the scatter plot shows the Pearson $r$ for each simulation from 1 to 1000. All the dots in between the range -1 to 1.

We can also look the distribution of r values we got in our simulations:

In [None]:
# seaborn displot() makes a histogram
# simulated correlations is an array of r values
sns.displot(simulated_correlations)


The distribution plot makes clear that the bulk of our simulated correlations were close to zero, and there is increasingly smaller chance of randomly observing large positive or negative r values. The significance value of a correlation is related to these distributions, but note that the more correlations you run, the more chance you have to find a false positive result.

The important lesson here is that random chance produced all of these correlations. This means we can find "correlations" in the data that are completely meaningless, and do not reflect any causal relationship between one measure and another.

### Summary

Correlation and the close related scatterplot are useful tools for examining the relationship between two variables with matched observations.

We can easily compute correlations using scipy.stats:

```
import scipys.stats as stats
x = some numbers
y = some numbers

r,p = stats.pearsonr(x, y)
```

Or using Pingouin:

```
# using pingouin with arrays of numbers
import pingouin as pg
x = some numbers
y = some numbers

results = pg.corr(x, y, method='pearson')

results['r'].values[0]
results['p-val'].values[0]
```

```
# using pingouin with a dataframe
import pingouin as pg
x = some numbers
y = some numbers

df = pd.DataFrame({'var1': x, 'var2': y})

results = pg.corr(x=df['var1'], y=df['var2'], method='pearson')

results['r'].values[0]
results['p-val'].values[0]
```




### Other kinds of correlation

This is of course not a complete walkthrough of all things correlation in Python. In particular we only looked at computing pearson's R and there other forms of correlation that one can compute. Pinguoin provides easy access to these other calculations using the `method=` argument with one of the following:


Correlation type:

'pearson': Pearson 𝑟 product-moment correlation

'spearman': Spearman 𝜌 rank-order correlation

'kendall': Kendall’s 𝜏𝐵 correlation (for ordinal data)

'bicor': Biweight midcorrelation (robust)

'percbend': Percentage bend correlation (robust)

'shepherd': Shepherd’s pi correlation (robust)

'skipped': Skipped correlation (robust)

For scipy these different kinds of correlation can be computed using different functions:

`scipy.stats.spearmanr`

`scipy.stats.kendalltau`