## Aims

This course will teach you about calculating correlations and significance testing, and introduce some ways to visualise the results.


Prior knowledge of Python, NumPy, Pandas, Iris, and Matplolib are assumed for this course.

## Table of Contents

* [Correlation](#correlation)
* [Correlation Coefficient](#correlation_coefficient)
* [Significance Testing](#significance_testing)
* [Examples using SciPy](#examples_scipy)
* [Examples using NumPy](#examples_numpy)
* [Examples using Pandas](#examples_pandas)
* [Correlation Coefficient Matrices](#correlation_matrix)
* [Examples using Iris](#examples_iris)
* [Exercise 1](#exercise1)


## Correlation<a class="anchor" id="correlation"></a>

Measures of correlation describe the relationship between two pairs of data. 

**Positive correlation** exists when the values in one dataset increase as the values in the other dataset increase, or when the values in one dataset decrease as the values in the other dataset decrease (the elements in the two datasets change in the same direction).

**Negative correlation** exists when the values in one dataset decrease as the values in the other dataset increase (the elements in the two datasets change in the oppposite direction. 

**No correlation** exists between two datasets that show no linear relationship to each other. 

In [None]:
import matplotlib.pyplot as plt


fig, (ax1, ax2, ax3) = plt.subplots(3, figsize=(5, 15))

datapos1 = [5, 9, 7, 6, 4, 3, 2]
datapos2 = [5, 2, 3, 4, 5, 6, 7]

ax1.scatter(datapos1, datapos2)
ax1.set_title('Negative Correlation')

datano1 = [5, 5, 4, 6, 3, 6, 7]
datano2 = [5, 6, 7, 6, 4, 8, 4]

ax2.scatter(datano1, datano2)
ax2.set_title('No Correlation')

dataneg1 = [5, 9, 7, 6, 4, 3, 2]
dataneg2 = [4, 7, 5, 4, 3, 2, 1]

ax3.scatter(dataneg1, dataneg2)
ax3.set_title('Positive Correlation')

for ax in [ax1, ax2, ax3]:
    ax.set_xlim(1,10)
    ax.set_ylim(1,10)
    ax.set_xticks([1,2,3,4,5,6,7,8,9,10])
    ax.set_xticklabels([1,2,3,4,5,6,7,8,9,10])
    
fig.tight_layout()

## Correlation coefficient<a class="anchor" id="correlation_coefficient"></a>

The correlation coefficient is the measure of the relationship between the two datasets. The correlation coefficient can range from -1 to 1, where:
- values less than 0 denote a negative correlation, 
- values greater than 0 denote a positive correlation and 
- 0 itself denotes no correlation. 

The closer the correlation coefficient is to either 1 or -1, the stronger the correlation and the closer the correlation coefficient is to 0, the weaker the correlation.

The correlation between datasets can be calculated by using the **Pearson's correlation** if the data are normally distributed and by using the **Spearman's correlation** if the relationship is non-linear or the data are not normally distributed. The Spearman's correlation may also be used for linearly-related data but the computed correlations will be weaker. If the distribution of the data or the possible relationships are not known, use the Spearman's correlation.

## Significance testing<a class="anchor" id="significance_testing"></a>

The correlation coefficient quantifies the relationship between two datasets, but the datasets are samples of the total population and if a different sample were drawn from the population, the relationship could be different. In order to identify whether the correlation between datasets is likely to be indiciative of the population not just the sample, significance testing must be performed.

The **p-value** is the probability that the correlation in the sample datasets occured by chance. The evaluation of the p-value is determined at a **significance level**, which is typically 0.01, 0.05 or 0.1:

- A p=value of 0.01 or below means that we can reject the null hypothesis and conclude that the correlation is statistically significant with 99% confidence (there is a 1% chance that the null hypothesis was correct). 
- A p=value of 0.05 or below means that we can reject the null hypothesis and conclude that the correlation is statistically significant with 95% confidence (there is a 5% chance that the null hypothesis was correct). 
- A p=value of 0.1 or below means that we can reject the null hypothesis and conclude that the correlation is statistically significant with 90% confidence (there is a 10% chance that the null hypothesis was correct). 



## Examples using the SciPy library<a class="anchor" id="examples_scipy"></a>

In [None]:
from scipy.stats import pearsonr
from scipy.stats import spearmanr


sample1 = [2, 5, 4, 23, 4, 5, 15, 1]
sample2 = [1, 3, 5, 20, 2, 6, 12, 2]

'''Pearsons correlation'''
coefficient,pval = pearsonr(sample1,sample2)
print(f'''Pearson's Correlation coefficient = {coefficient}''')
print(f'P-value = {pval}')

'''Spearmans correlation'''
coefficient,pval = spearmanr(sample1,sample2)
print(f'''Spearman's Correlation coefficient = {coefficient}''')
print(f'P-value = {pval}')

print(f'This is a strong positive correlation that is significant at the 99% confidence interval')

<font color='red'>**NOTE**</font>:  It is good practice to **always** plot the data because it is possible to get high correlations when no relationship exists and low correlations when a relationship does exist, e.g. see below.

In [None]:
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from scipy.stats import spearmanr

fig, (ax1, ax2) = plt.subplots(2)

sample1 = [5, 5, 5, 5, 5, 5, 15]
sample2 = [5, 2, 3, 4, 5, 6, 7]
coefficient,pval = pearsonr(sample1,sample2)

ax1.scatter(sample1, sample2)
ax1.set_title(f'Correlated ({coefficient:.2f})')

sample3 = [1, 2, 3, 4, 5, 6, 7]
sample4 = [1, 4, 8, 16, 8, 4, 1]
coefficient,pval = pearsonr(sample3,sample4)

ax2.scatter(sample3, sample4)
ax2.set_title(f'Not correlated ({coefficient:.2f})')
    
fig.tight_layout()

## Examples using the NumPy library<a class="anchor" id="examples_numpy"></a>

In [None]:
import numpy as np

'''Pearsons correlation'''
sample1 = [2, 5, 4, 23, 4, 5, 15, 1]
sample2 = [1, 3, 5, 20, 2, 6, 12, 2]
coefficient = np.corrcoef(sample1, sample2)
print(f'Correlation coefficient =')
print(f'{coefficient}\n')

# Note that this returns the correlation coefficient matrix, meaning that the value in
# the upper-left is the correlation between sample1 and sample1
# the upper-right is the correlation between sample1 and sample2
# the lower-left is the correlation between sample2 and sample2
# the lower-right is the correlation between sample2 and sample1
# so to extract the correlation we are interested in this must be specified:

print(f'Correlation coefficient = {coefficient[0,1]}')

# Note that NumPy does not return the p-value


## Examples using the Pandas library<a class="anchor" id="examples_pandas"></a>

In [None]:
import pandas as pd

'''Pearsons correlation'''
sample1 = [2, 5, 4, 23, 4, 5, 15, 1]
sample2 = [1, 3, 5, 20, 2, 6, 12, 2]
sample_df = pd.DataFrame(list(zip(sample1, sample2)), columns=['sample1', 'sample2'])
coefficient = sample_df.sample1.corr(sample_df.sample2)
print(coefficient)

# Note that Pandas does not return the p-value directly, use:

import numpy as np
from scipy.stats import pearsonr
pval = sample_df.corr(method=lambda x, y: pearsonr(x, y)[1]) - np.eye(len(sample_df.columns))
print(f'P-value = \n{pval}')

print(f'This is a strong positive correlation that is significant at the 99% confidence interval')

## Correlation Coefficient Matrices<a class="anchor" id="correlation_matrix"></a>
It can be useful to compute the correlations between all combinations of more than two datasets and visualise the correlations. The correlation matrix can be generated using `pandas` and then visualised using a Python library called `seaborn`. Another method of visualising the relationships (including correlations) is also demonstrated.

In [None]:
import seaborn as sn
import pandas as pd

sample1 = [2, 5, 4, 23, 4, 5, 15, 1]
sample2 = [1, 3, 5, 20, 2, 6, 12, 2]
sample3 = [3, 3, 6, 26, 4, 8, 13, 1]
sample_df = pd.DataFrame(list(zip(sample1, sample2, sample3)), columns=['sample1', 'sample2', 'sample3'])
correlation_matrix = sample_df.corr()
sn.heatmap(correlation_matrix, annot=True, cmap='BuPu')

In [None]:
import seaborn as sn
import pandas as pd
from scipy.stats import pearsonr
import matplotlib.pyplot as plt

def reg_coef(x,y,label=None,color=None,**kwargs):
    ax = plt.gca()
    corr,pval = pearsonr(x,y)
    ax.annotate('Correlation\n{:.2f}'.format(corr), xy=(0.5,0.5), xycoords='axes fraction', ha='center')
    ax.set_axis_off()

sample1 = [2, 5, 4, 23, 4, 5, 15, 1]
sample2 = [1, 3, 5, 20, 2, 6, 12, 2]
sample3 = [3, 3, 6, 26, 4, 8, 13, 1]
sample_df = pd.DataFrame(list(zip(sample1, sample2, sample3)), columns=['sample1', 'sample2', 'sample3'])

pg = sn.PairGrid(sample_df)
pg.map_diag(sn.histplot)
pg.map_lower(sn.regplot)
pg.map_upper(reg_coef)

## Examples using the Iris library<a class="anchor" id="examples_iris"></a>

By calculating the correlation between a dataset and every cell on a grid, spatial patterns can be identified on a map. See the example below correlating the air temperature in South Africa between January and March to global sea surface temperatures. There is a significant positive correlation between the temperature in South Africa and the sea surface temperatures in the eastern Pacific, suggesting a relationship to ENSO.

In [None]:
import iris
import numpy as np
import iris.analysis
import pandas as pd
from scipy.stats import pearsonr
from matplotlib import pyplot as plt
import cartopy.crs as ccrs
import iris.plot as iplt
from matplotlib import colors as clr, cm
import matplotlib as mpl
import cartopy.feature as cfeature
import warnings
warnings.filterwarnings('ignore')


def create_new_cube(cube_to_copy, long_name):
    '''
    Creates an iris cube with a single time dimension
    
    Input:
    -----
    * cube_to_copy
    iris cube of the cube_to_copy data
    
    Returns:
    * cube
    iris cube with a single time dimension
    '''
    corr_cube = cube_to_copy.copy()
    corr_cube.long_name = long_name
    corr_cube.standard_name = None
    corr_cube.units = ''
    corr_cube = corr_cube.collapsed('time', iris.analysis.MEAN)
    return corr_cube

def correlate(df_data1, cube_data2, fieldname):
    '''
    Calculates the correlations between a DataFrame and a cube
    
    Input:
    -----
    * df_data1
        pandas DataFrame of the Jan-Mar air temperature over South Africa
    * cube_data2
        iris cube of sea surface temperatures
    * fieldname
        string of the name of the column to use in the DataFrame
    
    Returns:
    --------
    * corr_cube
        iris cube of the calculated correlation coefficients
    * corr_pvals_cube
        iris cube of the p-values of the calculated correlation coefficients
    '''
    corr_cube = create_new_cube(cube_data2, 'Pearson Correlation Coefficient')
    pval_cube = create_new_cube(cube_data2, 'p-values')
    num_lats = len(cube_data2.coord('latitude').points)
    num_lons = len(cube_data2.coord('longitude').points)
    
    for i,lat in enumerate(cube_data2.coord('latitude').points):
        print('lat {} of {}'.format(i+1, num_lats))
        for j,lon in enumerate(cube_data2.coord('longitude').points):
            cube_lat_lon = cube_data2.extract(iris.Constraint(latitude=lat, longitude=lon))
            corrcoefs, p_vals = pearsonr(df_data1[fieldname], cube_lat_lon.data.data)
            corr_cube.data[i,j] = corrcoefs
            pval_cube.data[i,j] = p_vals
    return corr_cube, pval_cube

df_data1 = pd.read_csv('data/South_Africa_JFM_temperature.csv')
cube_data2 = iris.load_cube('data/OBS-ERA5_5deg_sst_jfm_1979_2018_low_res_5deg.nc')

corr_cube, corr_pvals_cube = correlate(df_data1, cube_data2, 'Temperature_anomaly')

fig = plt.figure()
pcf = iplt.contourf(corr_cube, [-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1], cmap='brewer_RdBu_11')
iplt.contourf(corr_pvals_cube, colors='none', levels=[0.0, 0.05], hatches=['xx']) 
plt.gca().add_feature(cfeature.LAND, zorder=10, edgecolor='k', facecolor='white')
cax = fig.add_axes([0.27, 0.08, 0.5, 0.04])
cb = fig.colorbar(pcf, cax=cax, extend='None', orientation='horizontal')
cb.set_label('Pearson correlation coefficient', size=15)

## Exercise 1<a class="anchor" id="exercise_1"></a>

Using any Python method, calculate the correlation matrix between the temperature and precipitation in South Africa and the maize yield, imports, exports and production. 
- Which two variables have the strongest positive correlation? 
- Which two variables have the strongest negative correlation? 
- Which two variables have the weakest correlation?


In [None]:
import pandas as pd

df_data = pd.read_csv('data/South_Africa_JFM_climate_crop_data.csv')

<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
**Solution**

<font color='red'>**NOTE**</font>: Your methods can include any Python library

Maize yield and production have the strongest positive correlation
Maize yield and temperature have the strongest negative correlation
Exports and precipitation have the weakest correlation