*The codes were developed on Windows 10, and were not tested on other machines. Anaconda 5.2.0 is chosen as a Python interpreter.*

This notebook introduces the concept and methodology of generating variograms.

# Spatial Data Simulation

Let's say that you are a spatial data analyst of a gold mining company, and want to know the distribution of gold percentage over 100m x 100m mining area. To understand the characteritics of the rock formations, you take 100 random rock samples from the mining area, but obviously these 100 data points are not enough to estimate gold percentage over every single spatial locations in the area. So you analyze the available data (100 rock samples from random locations) and simulate full 2D-surface plot for gold percentage over the mining area.

![Simulation](https://github.com/aegis4048/Petroleum_Engineering/blob/master/Data%20Analysis/img/gold_transform.png?raw=true)

This 2D surface simulation from sparse spatial data is a sequential process that involves many complicated statistical techniques. 

Steps:

1. Plot variogram
2. Fit variogram model
3. Apply kriging
4. Apply simulation on top of Kriging
5. Run simulation multiple times and perform additioanl data analyses as needed

In this post, the concepts, theory, and methodology of plotting a **variogram** will be covered. 


# Basics of Variograms

> **Variogram** is a measure of dissimilarity over a distance. It shows how two data points are correlated from a spatial perspective, and provides useful insights when trying to estimate the value of an unknown location using collected sample data from other locations.

[Tobler's first law of geography](https://en.wikipedia.org/wiki/Tobler%27s_first_law_of_geography) states that "everything is related to everything else, but near things are more related than distant things." Variogram demonstates just that. It shows how correlation between two spatial data varies over distances. For example, terrains 1 km apart from each other are more likely to be similar than terrains 100 km apart from each other. Oil wells 500 ft apart from each other are more likely to show similar reservoir characteristics than oil wells 5000 ft apart from each other. 


In [184]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import math
%matplotlib notebook

In [185]:
data = pd.read_excel('sample_data/2D_Data.xlsx', sheet_name="variogram_sample")

In [205]:
x = data['x']
y = data['y']
z = data['z']

def calc_1D_distance(_data):
    num_data = len(_data)
    matrix = []
    for i in range(num_data):
        line = []
        for j in range(num_data):
            var = _data[i] - _data[j]
            line.append(var)
        matrix.append(line)
    return pd.DataFrame(data=matrix)

def calc_2D_distance(_x, _y):
    
    assert (len(_x)==len(_y)), "x and yz coordinates must have the same dimension"
    
    num_data = len(_x)
    matrix = []
    for i in range(num_data):
        line = []
        for j in range(num_data):
            var = math.sqrt((_x[i] - _x[j])**2 + (_y[i] - _y[j])**2)
            line.append(var)
        matrix.append(line)
    return pd.DataFrame(matrix)

def calc_azimuth(_x, _y):
    
    assert (len(_x)==len(_y)), "x and y coordinates must have the same dimension"
    
    _dx = calc_1D_distance(_x)
    _dy = calc_1D_distance(_y)
    
    num_data = len(_dx)
    matrix = []
    for i in range(num_data):
        line = []
        for j in range(num_data):        
            if dx.iloc[i, j] > 0:
                azimuth = np.degrees(np.pi / 2 - np.arctan(dy.iloc[i, j]/ dx.iloc[i, j]))
            elif dx.iloc[i, j] < 0:
                azimuth = np.degrees(np.pi * 1.5 - np.arctan(dy.iloc[i, j]/ dx.iloc[i, j]))
            else:                                                         # dx = 0
                if dy.iloc[i, j] > 0:
                    azimuth = 0
                elif dy.iloc[i, j] < 0:
                    azimuth = 180
                else:
                    azimuth = 0                                       # dx, dy = 0
            line.append(azimuth)
        matrix.append(line)
    return pd.DataFrame(matrix)

def calc_pairwise_squared_difference(_z):
    num_data = len(_z)
    matrix = []
    for i in range(num_data):
        line = []
        for j in range(num_data):
            var = (_z[i] - _z[j]) ** 2
            line.append(var)
        matrix.append(line)    
    return pd.DataFrame(matrix)

def is_tolerated(_data, _var, _tol):
    return _data <= _var + _tol and _data > _var - _tol

def calc_num_pairs(_dist_max, _azi, _azi_tol, _lag_dist, _lag_tol, _x, _y):
    azi_df = calc_azimuth(x, y)
    dist_df = calc_2D_distance(x, y)
    
    num_data = len(x)
    lag_list = [i * 5 for i in range(int(_dist_max/_lag_dist) + 1)]
    
    npair_list = []
    for lag in lag_list:
        count = 0
        for i in range(num_data):
            for j in range(num_data):
                if is_tolerated(dist_df.iloc[i, j], lag, _lag_tol) and is_tolerated(azi_df.iloc[i, j], _azi, _azi_tol):
                    count += 1
        npair_list.append(count)        
    return pd.DataFrame(npair_list)
    
calc_num_pairs(100, 22.5, 22.5, 5, 10, x, y).T


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,20
0,4,9,19,25,29,29,28,31,29,29,...,22,16,21,17,14,14,4,3,0,0
