# Geostatistics

## 6. Excursus: IDW

This is a small excursus to IDW - **I**nverse **D**istance **W**eights. That's an interpolation technique, that is quite close to geostatistics. Some would even count it as an geostatistical method, although there is nothing statistical about it.

In [27]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform

from bokeh.io import output_notebook
from bokeh.plotting import figure, show, ColumnDataSource
from bokeh.transform import LinearColorMapper
from bokeh.models import ColorBar

output_notebook()

### 6.1 Theory

The main idea of IDW is simple. Estimate the value of any variable at an unobserved location as the mean value from the closest locations you have values for. To do so, weight the mean by the *inverse* of their distance. The closer, the more weight.

You have an observation $Z(s_1)$ for location $s_1$ and want to calculate the weight $\lambda(s_0, s_1)$ for an unobserved location $s_0$:

$$ \lambda(s_0, s_1) = \frac{1}{d(s_0, s_1)}$$

where $d$ is the distance as calculated in the last lecture.

The estimation at the unobserved location $s_0$ is then:

$$ Z^*(s_0) = \sum_{i = 1}^N \lambda(s_0, s_i) * Z(s_i)$$

At the same time we want to obtain the actual observation values for known location. To achieve this, we can require:

$$ Z^*(s_i) = Z(s_i) $$

which in turn yields the normalized weights $\lambda*$ like:

$$ \lambda^*(s_i) = \frac{\lambda(s_i)}{\sum_{j=1}^N \lambda_j (s_j)} $$

which combines to the IDW formula:

$$ Z^* (s_0) = \frac{\sum_{i=1}^N \frac{Z(s_i)}{d(s_0, s_i)}} {\sum_{i=1}^N \frac{1}{d(s_0, s_i)}} $$

Which is usually seen as the special case $m=1$ of the generalized form:

$$ Z^* (s_0) = \frac{\sum_{i=1}^N \frac{Z(s_i)}{d^m(s_0, s_i)}} {\sum_{i=1}^N \frac{1}{d^m(s_0, s_i)}} $$

In geoscience the usage of $m = 2$ is also quite common, as result becomes smoother.

### 5.2 Implementation

We will use the data examples from the last lecture.

In [9]:
coords = pd.read_csv('./data/sample_positions.txt', sep='\s+', header=None)
coords.columns = ['x', 'y']
data = pd.read_csv('./data/sample_data.txt', sep='\s+')
sample = coords.copy()
sample['z'] = data.loc[0, :].values

In [11]:
sample.head()

Unnamed: 0,x,y,z
0,22,78,-0.203508
1,3,73,-0.164411
2,12,85,-0.696673
3,9,69,-0.555673
4,78,43,1.286489


We want an estimation for $s_0 = (44,56)$

In [15]:
s0 = [44, 56]


norm_values = []
weights = []

for i in range(len(sample)):
    d = np.sqrt( (s0[0] - sample.loc[i, 'x'])**2 + (s0[1] - sample.loc[i, 'y'])**2 )
    norm_values.append(sample.loc[i, 'z'] / d)
    weights.append(1. / d)
    
Z0 = sum(norm_values) / sum(weights) 

print('Z*(s0) = ', round(Z0, 2))

Z*(s0) =  0.08


### 5.3 Example

Now we have 30 samples from the same field and can create a mesh-grid from the coordinates. The implementation above can then be applied to each unobserved location to receive an interpolation of the whole field.

In [34]:
obs = figure(
    title='Observations', width=700, height=700, toolbar_location="above",
    tooltips=[('value', '@z')], tools=['hover']
)

cmap = LinearColorMapper(palette='Cividis256', low=sample.z.min(), high=sample.z.max())
source = ColumnDataSource(sample)

obs.circle('x', 'y', source=source, size=12, line_color=None, fill_color={'field':'z', 'transform':cmap})
obs.add_layout(ColorBar(color_mapper=cmap, location=(0,0)), 'right')

In [35]:
show(obs)

Just to keep the processing time small, we will create a 5x5 raster.

In [91]:
size = 5
m = 2
ylim = (0, 100)
xlim = (0, 100)
idw = []

# build a 'mesh'-grid
grid = [[(j,i) for j in range(xlim[0], xlim[1] + size, size)] for i in range(ylim[0], ylim[1] + size, size)]

for row in grid:
    idw_row = []
    for cell in row:
        norm_values = []
        weights = []
        for i in range(len(sample)):
            # calculate distance from cell to all samples
            d = np.sqrt((cell[0] - sample.loc[i, 'x'])**2 + (cell[1] - sample.loc[i, 'y'])**2)
            norm_values.append(sample.loc[i, 'z'] / d**m)
            weights.append(1. / d**m)
        
        # append result
        z = sum(norm_values) / sum(weights)
        idw_row.append(z if not np.isnan(z) else None) # plotting workaround
    
    # append row
    idw.append(idw_row)






In [97]:
result = figure(
    title='IDW interpolation', x_range=(0,100), y_range=(0,100), tools=[]
)

result.image([idw], x=0, y=0, dw=100, dh=100, palette='Cividis256')
result.circle('x', 'y', source=source, size=12, line_color='white', fill_color={'field':'z', 'transform':cmap})

In [98]:
show(result)

The shown code is slow, from a algorithmic perspective and especially from the implementation perspective. 
From the algorithmic perspective ask yourself the question:

*Do we always have to include the whole sample into the weighted mean? If not, why?*

If you installed the lectures locally or are running it in a binder, you can play around a little bit:

* increase and decrease the raster size - what happens to the result?
* increase m - what happens?
* can you speed up the calculation by changing the *algorithm* ? (not talking about numpy or `map` here and simply moving to faster implementations)

That's always my way to go -> first make the algorithm fast by implementing it in a smart way, then increase performance by using `numpy` (in Pyhton) or precompiled structures (vectorized vs. loops in R). 