# Geostatistics

## 6. Variograms

This lecture focuses on theoretical variogram functions and the fitting to experimental data.

* empirical variograms
* theoretical functions
* variogram parameters
* fitting

In [1]:
import numpy as np
import pandas as pd
import skgstat as skg

from bokeh.io import output_notebook
from bokeh.plotting import figure, show

output_notebook()

## 6.1 From Covariance to Semi-variance

It is possible to work with spatial covariance functions in geostatistics, like we used one in the last lecture.
In this lecture, we want to focus on the most common tool of a *variogram*, which is very closely related to the covariance function we built in the last lecture.

The *variogram* comes with a few advantages

*In the last lecture, we implemented each and every calculation ourselves. This was important to understand, what's going on. In this lecture, we will use thrid-party Python modules to save some time. New content will again be introduced with examples how to algorithm the shown problems. This can then be directly implemented in other languages than Python*

First, we need to go back to the Covariance function we used before:

For a set of *observations* $x$ and another set of observations $x_h$ at a *spatial* lag $h$, their covariance was defined as: 

$$ Cov(x,x_h) = \frac{\sum_{i=1}^N [(x_i - \mu_x)*(x_{h, i} - \mu_{x_h})]}{N} $$

Where $N$ is the length of $u$ and $\mu$ is their respective expected value. That means, calculating the covariance at a specific lag, the two $\mu$ are constant for all elements in the sum.
We can solve the parenthesis in the equation above to:

$$ Cov(x,x_h) = \frac{\sum_{i=1}^N (x_i*x_{h,i})}{N} - \mu_x * \mu_{x_h} = Cov(h)$$

If we want to have a good estimation for the population from the covariance as defined above, we have to make sure, that the three expected values are actually a good measure for the samples distribution:

That's the case for a *normal distribution*, as all expected values are just the arithmetic mean values. Thus, we require that:

* $x$ and $x_h$ are normal distributed

* Remind that $h$ is a spatial lag, not a location. Thus $x_h$ should be normal distributed at any given *lag*.

This is called **second order stationarity**, as the the mean and the variance of the **Variable** should be stationary

But we run into more assumptions/limitations here. Think of the Covariance function at the lag 0

$$ Cov(0) = Cov(x,x) := Var(x,x) $$

Thus, we require the *empirical* covariance function to fit to the Variance at lag 0, which is often difficult to reach due to the limited amount of point pairs at very close distance.

Further, we are seeking for the a special lag $r$, at which samples become statistically independed. Theoretical Convariance model function will not map values $Cov < 0$, but in turn empirical values might be negative. 

Thus we need means that can in principle *model* unbounded increasing dissimilarities with distance.

That function is called a *semi-variogram*. It does not use the covariance, but the semi-variance and in an case of stationarity it's defined like:

$$ \gamma(h) = \sigma^2 - C(h) $$

The semi-variance can be calculated like:

$$ \gamma(h) = \frac{1}{2N} \sum_{i=1}^N (x_i - x_{h,i})^2 $$


* $\gamma$ does not involve the $\mu$ anymore
* We calculate a measure for the pair-wise deviations that becomes **larger** for bigger differences

The semi-variance has a few advantages over using the covariance function:

* We can relax the assumptions of stationarity to requireing only the *deviations* to be only dependent of the lag $h$
* The $\gamma(0)$ can be `0`, or larger. Which fits better to real world data that often is prone to measurement uncertainty or small scale variations
* $\gamma$ may or may not increase unbounded with $h$. From this we can define **unlimited** functions in cases where the covariance function is not defined. 

## 6.2 Empirical variograms

One of the (Python) modules for geostatistics is [scikit-gstat](https://github.com/mmaelicke/scikit-gstat). To actually calculate a variogram, we will use this module to obtain all the data we used in the last lecture with less effort.

For exercises, you can of course also reuse the code from the last lecture. 

In [2]:
data = pd.read_csv('./data/sample_data.txt', sep='\s+')
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20,21,22,23,24,25,26,27,28,29
0,-0.203508,-0.164411,-0.696673,-0.555673,1.286489,-0.916081,0.572388,-0.912941,-2.571167,0.763894,...,-0.549597,0.292536,0.793382,-0.216959,1.840647,-0.582903,0.010175,-0.474006,-0.331282,-0.230938
1,-0.585733,0.035986,-0.82284,0.105027,0.806096,-1.280912,0.808606,-0.971974,-2.160372,0.733315,...,-0.593188,1.084725,0.558109,-0.216897,1.355106,-0.623354,-0.065394,-0.573369,-0.679012,0.065802
2,-0.346062,-0.188272,0.217902,0.357807,0.985086,-1.709977,-0.167685,-0.921627,-2.44877,1.115545,...,-0.356511,0.828866,1.440411,-0.151955,0.58449,-0.990339,-0.759851,-0.300697,-0.474233,0.073369
3,-0.478679,0.365037,-0.416354,0.334278,1.457831,-0.882513,0.056564,-0.64645,-2.482363,0.163104,...,-0.186352,1.360957,0.375749,-0.159243,1.155086,-0.250776,-0.036381,0.079864,0.118114,-0.364431
4,-0.736619,-0.028943,-0.418246,0.02998,0.101014,-0.959323,0.189186,-0.590574,-2.206486,1.310469,...,-0.117542,0.969032,0.379265,0.330215,1.372918,-0.732134,-0.520751,-0.340854,-0.259725,-0.327924


In [3]:
coords = pd.read_csv('./data/sample_positions.txt', sep='\s+', header=None)
coords.columns = ['x', 'y']
coords.head()

Unnamed: 0,x,y
0,22,78
1,3,73
2,12,85
3,9,69
4,78,43


In [4]:
sample = coords.copy()
sample['z'] = data.loc[0,:].T.values
sample.head()

Unnamed: 0,x,y,z
0,22,78,-0.203508
1,3,73,-0.164411
2,12,85,-0.696673
3,9,69,-0.555673
4,78,43,1.286489


The main advantage of using the `Variogram` class in scikit-gstat is that it will make all the intermediate calculations steps available to us. 

In [5]:
# using the same settings as in the last lecture
V = skg.Variogram(coords, sample.z, maxlag='median', n_lags=6, normalize=False)

The 1D distance array is available as `V.distance`, the 2D distance matrix as `V.distance_matrix`. The bin edges are available as `V.bins` and the grouping array (1D version) as `V.lag_groups`.

In [6]:
print(V.bins.round(2))
print(V.distance[:10].round(2))
print(V.lag_groups()[:10])

[ 8.64 17.29 25.93 34.57 43.21 51.86]
[19.65 12.21 15.81 66.04 74.46 22.2  62.1  77.52 35.   30.41]
[ 2  1  1 -1 -1  2 -1 -1  4  3]


The values are also available, as well as the pairwise differences. The class again makes the 1D and 2D version available. But there's also a small inconsistency in the class. The actual observation are available as `V.values`. The pairwise differences are at `V._diff` as 1D and `V.distance_matrix` as 2D. 

In [7]:
print(V.value_matrix[:10, :10].round(2))
print('----------------------------')
print(V._diff[:10].round(2))

[[0.   0.04 0.49 0.35 1.49 0.71 0.78 0.71 2.37 0.97]
 [0.04 0.   0.53 0.39 1.45 0.75 0.74 0.75 2.41 0.93]
 [0.49 0.53 0.   0.14 1.98 0.22 1.27 0.22 1.87 1.46]
 [0.35 0.39 0.14 0.   1.84 0.36 1.13 0.36 2.02 1.32]
 [1.49 1.45 1.98 1.84 0.   2.2  0.71 2.2  3.86 0.52]
 [0.71 0.75 0.22 0.36 2.2  0.   1.49 0.   1.66 1.68]
 [0.78 0.74 1.27 1.13 0.71 1.49 0.   1.49 3.14 0.19]
 [0.71 0.75 0.22 0.36 2.2  0.   1.49 0.   1.66 1.68]
 [2.37 2.41 1.87 2.02 3.86 1.66 3.14 1.66 0.   3.34]
 [0.97 0.93 1.46 1.32 0.52 1.68 0.19 1.68 3.34 0.  ]]
----------------------------
[0.04 0.49 0.35 1.49 0.71 0.78 0.71 2.37 0.97 0.05]


Thus, now we are settled to calculate the semi-variances:

In [8]:
k = 6
vario = []

for _k in range(k):
    squared_differences = []
    
    for idx, group in enumerate(V.lag_groups()):
        if group == _k:  # right bin
            squared_differences.append(V._diff[idx]**2)
    
    # gamma is sum divided by 2 * length
    vario.append(sum(squared_differences) / (2*len(squared_differences)))

In [9]:
variogram = figure(
    title='Empirical Variogram', width=700, height=400, x_axis_label='lag [m]', y_axis_label='Semi-Variance',
    tooltips=[('Lag', '@x'), ('Semi-variance', '@y')], tools=['hover']
)
variogram.circle(V.bins, vario, size=8, line_color='#6600CC', fill_color='#6666CC')

In [10]:
show(variogram)

* The semi-variance is increasing linearly. 

* While this is a valid empirical variogram, it is absolutely possible that we just cut off the point pair formation too early when setting the maximum lag to `median`, which is roughly at 50 meters

* The samples were taken from a 100x100 area, so we can go to `100` as a meaningful maximum lag distance setting.

* To resolve the larger distance, we should also increase the number of bins

For this to happen, we need to go back and re-define the bins, then re-index the lag class grouping and iterate over all groups to collect all pair-wise differences in the new lags bins. Then, `vario` can be re-calculated.

Or, we let the scikit-gstat `Variogram` class demonstrate, when object-orientated programming is a good idea:

In [11]:
V.maxlag = 100
V.n_lags = 8

That's it.

The resulting empirical, or **experimental**, variogram can be accessed from a property of same name:

In [12]:
print(V.experimental.round(2))

[0.07 0.26 0.53 0.92 0.95 0.79 0.71 0.76]


In [13]:
variogram = figure(
    title='Empirical Variogram', width=700, height=400, x_axis_label='lag [m]', y_axis_label='Semi-Variance',
    tooltips=[('Lag', '@x'), ('Semi-variance', '@y')], tools=['hover']
)
variogram.circle(V.bins, V.experimental, size=8, line_color='#6600CC', fill_color='#6666CC')

In [14]:
show(variogram)

This way, you can check hundreds of fine-tuned values with ease. But it is neccessary to understand what is happening under the hood.

## 6.3 Theoretical variogram models

Empirical variograms already tell us a lot about spatial structure and an apparent spatial dependency in the data set. Now, we want to describe it more systematically through a mathematical function.

That has a lot of advantages including:

* we can define models of known mathematical properties

* one property is that the model is monotonically increasing, which is a prerequisite to use it for interpolation

* we can compare models in a systematic manner

* Most theoretical variogram models can be described by only three parameters:

![](https://gisgeography.com/wp-content/uploads/2016/10/Variogram-Nugget-Range-Sill.png)
<div style="size: 8px">&copy; gisgeography.com  - source: https://gisgeography.com/wp-content/uploads/2016/10/Variogram-Nugget-Range-Sill-425x279.png</div>

* **Sill** is the total variance in the dataset that the samples are approaching with distance.
* **Range**, which I like to call **effective range**, is the *distance* at which the sill is reached. For some models 95% of the sill is used, because the model is approaching the sill asymtotically, but never reaches it.
* **Nugget** is the y-axis intersect. That's the (semi)variance at a lag distance of `0`

A very helpful diagnostic tool, when the parameters above have been calculated is to look at the **nugget/sill** ratio. 

The nugget is the share of the total variance that you will not be able to model spatially, no matter what. Therefore, using this ratio can give you an idea of how much variance in the *sample* the **model** can explain. Keep in mind that this is a statistical model of the sample, not a feature of the population.

### 5.3.1 spherical model

The most used geostatistical model is the so called **spherical model**. It is defined like:

$$ \gamma(h) = b+ C_0 * \left(\frac{3}{2}*\frac{h}{a} - \frac{1}{2}*\left(\frac{h}{a}\right)^3 \right) $$

if $h \leq a $ and

$$ \gamma(h) = b + C_0$$ 

otherwise.
here: $C_0$ is the sill, $b$ is the nugget and $a$ is the variogram parameter *range*, which is **not** the effective range $r$. For the case of a spherical model $a := r$

In [15]:
def spherical(h, r, C0, b):
    a = 1 * r
    if h <= a:
        return b + C0 * (1.5*(h / a) - 0.5 * (h / a)**3)
    else:
        return b + C0

In [17]:
x = list(range(100))
y = [spherical(_, 45, 13, 2) for _ in x]

In [19]:
sph_model = figure(title='spherical model', width=700, height=400)
sph_model.line(x,y, color='green')
show(sph_model)

In [20]:
variogram