
<p align="center">
    <img src="https://github.com/GeostatsGuy/GeostatsPy/blob/master/TCG_color_logo.png?raw=true" width="220" height="240" />

</p>

## Interactive Correlation Coefficient Limitations Demonstration


### Michael Pyrcz, Associate Professor, University of Texas at Austin 

##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)


### The Correlation Coefficient Workflow

Here's a simple, interactive workflow to help visualize the limitations of the correlation coefficient through the addition of a single outlier. 

* students have asked me about the impact of outliers on Pearson product-moment and Spearman's rank correlation coefficients

* to help them understand, I have coded these simple examples

We will keep it quite simple:

* we assume the two features are Gaussian distributed, univariate and bivariate

#### Bivariate Analysis

Understand and quantify the relationship between two features

* how can we use this relationship?

What would be the impact if we ignore this relationship between featurs and simply modeled them independently?

* no relationship beyond constraints at data locations
* independent away from data
* nonphysical results, unrealistic uncertainty models

#### Bivariate Statistics

Pearson’s Product‐Moment Correlation Coefficient

* Provides a measure of the degree of linear relationship.
* We refer to it as the 'correlation coefficient'

Let's review the sample variance of variable $x$. Of course, I'm truncating our notation as $x$ is a set of samples a locations in our modeling space, $x(\bf{u_\alpha}), \, \forall \, \alpha = 0, 1, \dots, n - 1$.

\begin{equation}
\sigma^2_{x}  = \frac{\sum_{i=1}^{n} (x_i - \overline{x})^2}{(n-1)}
\end{equation}

We can expand the the squared term and replace on of them with $y$, another variable in addition to $x$.

\begin{equation}
C_{xy}  = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{(n-1)}
\end{equation}

We now have a measure that represents the manner in which variables $x$ and $y$ co-vary or vary together.  We can standardized the covariance by the product of the standard deviations of $x$ and $y$ to calculate the correlation coefficent. 

\begin{equation}
\rho_{xy}  = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{(n-1)\sigma_x \sigma_y}, \, -1.0 \le \rho_{xy} \le 1.0
\end{equation}

In summary we can state that the correlation coefficient is related to the covariance as:

\begin{equation}
\rho_{xy}  = \frac{C_{xy}}{\sigma_x \sigma_y}
\end{equation}

The Person's correlation coefficient is quite sensitive to outliers and depature from linear behavoir (in the bivariate sense).  We have an altenrative known as the Spearman's rank correlations coefficient.   

\begin{equation}
\rho_{R_x R_y}  = \frac{\sum_{i=1}^{n} (R_{x_i} - \overline{R_x})(R_{y_i} - \overline{R_y})}{(n-1)\sigma_{R_x} \sigma_{R_y}}, \, -1.0 \le \rho_{xy} \le 1.0
\end{equation}

The rank correlation applies the rank transform to the data prior to calculating the correlation coefficent.  To calculate the rank transform simply replace the data values with the rank $R_x = 1,\dots,n$, where $n$ is the maximum value and $1$ is the minimum value. 

\begin{equation}
x_\alpha, \, \forall \alpha = 1,\dots, n, \, | \, x_i \ge x_j \, \forall \, i \gt j 
\end{equation}

\begin{equation}
R_{x_i} = i
\end{equation}

The corelation coefficients provide useful metrics to quantify relationships between two variables at a time. We can also consider bivariate scatter plots and matrix scatter plots to visualize multivariate data. In general, current practical subsurface modeling is bivariate, two variables at a time.    

#### Objective 

In the PGE 383: Stochastic Subsurface Modeling class I want to provide hands-on experience with building subsurface modeling workflows. Python provides an excellent vehicle to accomplish this. I have coded a package called GeostatsPy with GSLIB: Geostatistical Library (Deutsch and Journel, 1998) functionality that provides basic building blocks for building subsurface modeling workflows. 

The objective is to remove the hurdles of subsurface modeling workflow construction by providing building blocks and sufficient examples. This is not a coding class per se, but we need the ability to 'script' workflows working with numerical methods.    

#### Getting Started

Here's the steps to get setup in Python with the GeostatsPy package:

1. Install Anaconda 3 on your machine (https://www.anaconda.com/download/). 
2. From Anaconda Navigator (within Anaconda3 group), go to the environment tab, click on base (root) green arrow and open a terminal. 
3. In the terminal type: pip install geostatspy. 
4. Open Jupyter and in the top block get started by copy and pasting the code block below from this Jupyter Notebook to start using the geostatspy functionality. 

You will need to copy the data file to your working directory.  They are available here:

* Tabular data - sample_data.csv at https://git.io/fh4gm.

There are exampled below with these functions. You can go here to see a list of the available functions, https://git.io/fh4eX, other example workflows and source code. 

#### Load the required libraries

The following code loads the required libraries.

In [1]:
import geostatspy.GSLIB as GSLIB                       # GSLIB utilities, visualization and wrapper
import geostatspy.geostats as geostats                 # GSLIB methods convert to Python    

We will also need some standard packages. These should have been installed with Anaconda 3.

In [2]:
%matplotlib inline
import os                                               # to set current working directory 
import sys                                              # supress output to screen for interactive variogram modeling
import io
import numpy as np                                      # arrays and matrix math
import pandas as pd                                     # DataFrames
from scipy import stats
import matplotlib.pyplot as plt                         # plotting
from matplotlib.pyplot import cm                        # color maps
from matplotlib.patches import Ellipse                  # plot an ellipse
import math                                             # sqrt operator
import random                                           # random simulation locations
from copy import copy                                   # copy a colormap
from scipy.stats import norm
from ipywidgets import interactive                      # widgets and interactivity
from ipywidgets import widgets                            
from ipywidgets import Layout
from ipywidgets import Label
from ipywidgets import VBox, HBox
from scipy.stats import norm                            # Gaussian distribution
import scipy.stats as st                                # statistical methods

If you get a package import error, you may have to first install some of these packages. This can usually be accomplished by opening up a command window on Windows and then typing 'python -m pip install [package-name]'. More assistance is available with the respective package docs.  

#### Interactive Correlation Coefficient

Draw random values from a bivariate Gaussian distribution parameterized by:

* **$\overline{X}_1$, $\overline{X}_2$** - mean of features $X_1$ and $X_2$

* **$\sigma_{X_1}$,$\sigma_{X_1}$** - standard deviation of features $X_1$ and $X_2$  

* **$\rho_{X,Y}$** - Pearson product-moment correlation coefficient 

Now let's set up our dash board.

In [12]:
import warnings; warnings.simplefilter('ignore')

# dashboard: number of simulation locations and variogram parameters
style = {'description_width': 'initial'}
l = widgets.Text(value='                                        Correlation Coefficient with an Outlier, Michael Pyrcz, Associate Professor, The University of Texas at Austin',layout=Layout(width='950px', height='30px'))
ndata = widgets.IntSlider(min = 5, max = 1000, value = 50, step = 1, description = r'$n_{samples}$',orientation='horizontal',continuous_update=False,
                          layout=Layout(width='600px', height='40px'))
ndata.style.handle_color = 'gray'

corr = widgets.FloatSlider(min = -1.0, max = 1.0, value = 0, step = 0.1, description = r'$\rho_{x_1,x_2}$',orientation='horizontal',continuous_update=False,
                          layout=Layout(width='600px', height='40px'))

ox = widgets.FloatSlider(min = 1.0, max = 2.8, value = 1, step = 0.1, description = r'$log(x_{n+1})$',orientation='horizontal',continuous_update=False,
                          layout=Layout(width='600px', height='40px'))

oy = widgets.FloatSlider(min = 1.0, max = 2.8, value = 1, step = 0.1, description = r'$log(y_{n+1})$',orientation='horizontal',continuous_update=False,
                          layout=Layout(width='600px', height='40px'))

corr.style.handle_color = 'gray'

uipars = widgets.HBox([ndata,corr,ox,oy],)     

uik = widgets.VBox([l,uipars],)

def f_make(ndata,corr,ox,oy): # function to take parameters, make sample and plot
    ox = 10**ox; oy = 10**oy
    text_trap = io.StringIO()                           # suppress all text function output to dashboard to avoid clutter 
    sys.stdout = text_trap
    cmap = cm.inferno
    np.random.seed(seed = 73072)                        # ensure same results for all runs
    mean = np.array([10,10])
    correl = np.array([[2.0,corr*2.0],[corr*2.0,2.0]],dtype=float)
    sample = np.random.multivariate_normal(mean,correl,size = ndata)
    sample = np.vstack([sample,[ox,oy]])
    slope, intercept, r_value, p_value, std_err = st.linregress(sample[:,0],sample[:,1])
    xmin = min(-3,ox-1); xmax = max(3,ox+1); ymin = min(-3,oy-1); ymax = max(3,oy+1)
    xmin = 1; ymin = 1; xmax = 1000; ymax = 1000
    x1 = np.array([xmin,xmax])
    x2 = x1*slope + intercept
    
    nbin = int(ndata / 10)
    plt_scatter = plt.subplot2grid((3, 3), (1, 0), rowspan=2, colspan=2)
    plt_x1 = plt.subplot2grid((3, 3), (0, 0), colspan=2,
                               sharex=plt_scatter)
    plt_x2 = plt.subplot2grid((3, 3), (1, 2), rowspan=2,
                               sharey=plt_scatter)    
    
    #plt.plot([0,0],[1.0,1.0],color = 'black')

    
#     plt_scatter.plot(x1,x2,color = 'black',label = r'$X_2 = f(X_1)$')
    plt_scatter.scatter(sample[:ndata-1,0],sample[:ndata-1,1],color = 'red',alpha = 0.2,edgecolors='black',label = 'Samples')
    plt_scatter.scatter(ox,oy,color = 'blue',alpha = 1.0,marker='s',edgecolors='black',label = 'Outlier')
    plt_scatter.set_xlabel(r'$x_1$')
    plt_scatter.set_ylabel(r'$x_2$')
    plt_scatter.set_xlim([xmin,xmax])
    plt_scatter.set_ylim([ymin,ymax])
    plt_scatter.legend(loc='upper left') 
    plt_scatter.set_xscale('log'); plt_scatter.set_yscale('log')
    
    #ax = plt_scatter.gca()
    corr = stats.pearsonr(sample[:,0],sample[:,1])[0]
    plt_scatter.annotate(r'$\rho$ = ' + str(np.round(corr,3)),(xmin+(xmax-xmin)*0.2,ymax-(ymax-ymin)*0.9985),size=15)
    corrs = stats.spearmanr(sample[:,0],sample[:,1])[0]    
    plt_scatter.annotate(r'$\rho_{r}$ = ' + str(np.round(corrs,3)),(xmin+(xmax-xmin)*0.2,ymax-(ymax-ymin)*0.9995),size=15)
#     ax.annotate('Simple Kriging Variance = ' + str(np.round(sk_var[0],2)), (0.05*(vmax-vmin)+vmin, 0.83))
#     ax.annotate('Local P10 = ' + str(np.round(np.percentile(samples,10),2)), (0.05*(vmax-vmin)+vmin, 0.76))
#     ax.annotate('Local P90 = ' + str(np.round(np.percentile(samples,90),2)), (0.05*(vmax-vmin)+vmin, 0.69))
    
    plt_x1.hist(sample[:,0],density = True,color='red',alpha=0.8,edgecolor='black',bins=np.logspace(np.log10(xmin),np.log10(xmax),nbin))
    plt_x1.set_ylim([0.0,0.3])
    plt_x1.set_xlabel(r'$x_1$'); plt_x1.set_ylabel(r'Density')
    plt_x1.set_title(r'Bivariate Gaussian Distributed Data with $\rho =$' + str(round(corr,3)) + ' & 1 Outlier')
    
    plt_x2.hist(sample[:,1],orientation='horizontal',density = True,color='red',alpha=0.8,edgecolor='black',bins=np.logspace(np.log10(ymin),np.log10(ymax),nbin))
    plt_x2.set_xlim([0.0,0.3])
    plt_x2.set_ylabel(r'$x_2$'); plt_x2.set_xlabel(r'Density')
    plt_scatter.set_ylabel(r'$x_2$')
    
    plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.7, wspace=0.3, hspace=0.3)
    plt.show()
    
# connect the function to make the samples and plot to the widgets    
interactive_plot = widgets.interactive_output(f_make, {'ndata':ndata,'corr':corr,'ox':ox,'oy':oy})
#interactive_plot.clear_output(wait = True)               # reduce flickering by delaying plot updating

### Interactive Correlation Coefficient Demonstration

* select the number of data, correlation coefficient, an outlier and compare the Pearson product-momment and Spearman's rank correlation coefficients. 

#### Michael Pyrcz, Associate Professor, University of Texas at Austin 

##### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1) | [GeostatsPy](https://github.com/GeostatsGuy/GeostatsPy)

### The Inputs

Select the number of samples and the Pearson product-moment correlation coefficient:

* **$n_{samples}$**: number of samples, **$\rho_{x_1,x_2}$**: the Pearson product-moment correlation
* **$log(x_{n+1}$**, **$log(y_{n+1}$**: location of a single outlier 

In [13]:
display(uik, interactive_plot)                            # display the interactive plot

VBox(children=(Text(value='                                        Correlation Coefficient with an Outlier, Mi…

Output(outputs=({'output_type': 'display_data', 'data': {'text/plain': '<Figure size 432x288 with 3 Axes>', 'i…

#### Comments

This was an interactive demonstration of the impact onf an outlier on the Pearson product-moment and Spearman's rank correlation coefficients. Providing students an opportunity to play with data analytics, geostatistics and machine learning for experiential learning.
  
#### The Author:

### Michael Pyrcz, Associate Professor, University of Texas at Austin 
*Novel Data Analytics, Geostatistics and Machine Learning Subsurface Solutions*

With over 17 years of experience in subsurface consulting, research and development, Michael has returned to academia driven by his passion for teaching and enthusiasm for enhancing engineers' and geoscientists' impact in subsurface resource development. 

For more about Michael check out these links:

#### [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)

#### Want to Work Together?

I hope this content is helpful to those that want to learn more about subsurface modeling, data analytics and machine learning. Students and working professionals are welcome to participate.

* Want to invite me to visit your company for training, mentoring, project review, workflow design and / or consulting? I'd be happy to drop by and work with you! 

* Interested in partnering, supporting my graduate student research or my Subsurface Data Analytics and Machine Learning consortium (co-PIs including Profs. Foster, Torres-Verdin and van Oort)? My research combines data analytics, stochastic modeling and machine learning theory with practice to develop novel methods and workflows to add value. We are solving challenging subsurface problems!

* I can be reached at mpyrcz@austin.utexas.edu.

I'm always happy to discuss,

*Michael*

Michael Pyrcz, Ph.D., P.Eng. Associate Professor The Hildebrand Department of Petroleum and Geosystems Engineering, Bureau of Economic Geology, The Jackson School of Geosciences, The University of Texas at Austin

#### More Resources Available at: [Twitter](https://twitter.com/geostatsguy) | [GitHub](https://github.com/GeostatsGuy) | [Website](http://michaelpyrcz.com) | [GoogleScholar](https://scholar.google.com/citations?user=QVZ20eQAAAAJ&hl=en&oi=ao) | [Book](https://www.amazon.com/Geostatistical-Reservoir-Modeling-Michael-Pyrcz/dp/0199731446) | [YouTube](https://www.youtube.com/channel/UCLqEr-xV-ceHdXXXrTId5ig)  | [LinkedIn](https://www.linkedin.com/in/michael-pyrcz-61a648a1)  
  