# Introduction

The aim of this notebook is to estimate the amount of coordinated motion within a well. This is done by calculating the spatial auto-correlation for each average velocity vector field of all the wells in all experiments.

### Coordinated motion & Spatial auto-correlation

- Within our average velocity vector fields, there will be regions where a cluster of neighbouring regions are moving in a similar direction (coordinated motion) and there will non-coordinated regions too. 
- For a 2D scalar field, the calculation of local spatial auto-correlation, produces a 2D mask that shows where the scalar field is clustered, randomly distributed or dispersed. This calculation is carried out using the `pyslal/esda` Python library. Specifically, the local Geary's C value is a measure of local spatial auto-correlation. See the [original paper](https://onlinelibrary.wiley.com/doi/full/10.1111/gean.12164#:~:text=The%20Local%20Geary%20ci%20is%20a%20univariate%20statistic.,4) for more detail or [this video](https://www.youtube.com/watch?v=EKIKEeAw0W8&ab_channel=GeoDaSoftware) for a quick walk through. Some detail can also be found in Slides 8 to 15 in `reports/summary_meeting_presentations/09_09_20_summary.pptx`. 
- However, the velocity vector field is not a scalar field. Therefore, it is decomposed into three scalar fields: `speed`, `vx`, `vy`, where `vx`, `vy` are measures of the motion in the `x` and `y` directions, respectively. Therefore, three Geary's c masks are produced given a single velocity vector field.
    - Decomposing the velocity vector field into `speed` and `angle` scalar fields was also considered. However, the angle scalar field had the issue of a sharp boundary between pixels where the angles were close to 0 and 360 degrees. 
- There is a single tuneable parameter: `NEIGHBORHOOD_SIZE`. This is tuned in [4_sensitivity_analysis_patch_size_vs_coordination_proportion](auxiliary_analysis/4_sensitivity_analysis_patch_size_vs_coordination_proportion.ipynb). When calculating Geary's C, the local neighbourhood of a pixel must be defined. For example, in the figure below, the red neighbourhood has `NEIGHBORHOOD_SIZE=3` and the green neighbourhood has `NEIGHBORHOOD_SIZE=5`.

    <img src="markdown_images_for_notebooks/neighbourhood-size.PNG">


# Imports

In [None]:
import numpy as np
import scipy as sp
import os
import json

from matplotlib import pyplot as plt
from tqdm import tqdm

from fam13a import utils, register
from fam13a import spatial_autocorrelation as sac
from multiprocessing import Pool

import time
import pysal as ps
from esda.geary_local import Geary_Local

# Constants

In [None]:
PROJ_ROOT = utils.here()
# declare the data input directory
HBEC_ROOT = os.path.join(PROJ_ROOT, 'data', 'processed', 'hbec')
# print list of experiment IDs
print(os.listdir(HBEC_ROOT))

experiment_list = os.listdir(HBEC_ROOT)

# set level of parallelisation
# the number of NCPUS is the number of files that will be processed in parallel.
# For example, if the current node that the notebook is being run on is 32 cores, 
# then setting NCPUS to 4 will mean 4 files are processed in paraller and that 8
# cores are used by each file for processing. The more cores provided to a file, 
# the shorter the processing time. Make sure that the available cores on this node
# is greater than NCPUS.
NCPUS = 4

In [None]:
experiment = 'N67030-59_8_perc'

In [None]:
# function to retrieve image size of data given a single experiment
def get_example_image_size(experiment_str):
    # path to the register directory for given experiment
    data_dir_path = os.path.join(HBEC_ROOT, experiment_str, 'register')
    # consider only those directories in "register" that are directories (the other files could be .dvc files)
    register_dirs = [name for name in os.listdir(data_dir_path) if os.path.isdir(os.path.join(data_dir_path, name))]
    # isolate the path to an example max_frame array
    max_frame_file_path = os.path.join(data_dir_path, register_dirs[0], 'max_frame.npy')
    max_frame = np.load(max_frame_file_path)
    array_shape = max_frame.shape
    
    return array_shape[0]

# Set Up

In [None]:
# when calculating the gearys c value, significance testing is also carried out by 
# permutating the neighborhood of a pixel
PERMUTATIONS = 999

# require the image size of the 2D matrices that for which Geary's C will be calculated
IMAGE_SIZE = get_example_image_size(experiment)

# creating a neighbor patch - to define neighbors of any element in an image
NEIGHBORHOOD_SIZE = 9
NEIGHBORHOOD_PATCH = sac.create_neighbor_patch(NEIGHBORHOOD_SIZE)

# create the spatial dictionaries (neighbors and weights)
SPATIAL_DICT = sac.create_spatial_dicts(IMAGE_SIZE, NEIGHBORHOOD_PATCH)

GEARYS_WEIGHT_OBJECT = ps.lib.weights.W(neighbors=SPATIAL_DICT['neighbors'], weights=SPATIAL_DICT['weights'])

In [None]:
def process_geary(mat):
    lg = Geary_Local(
        GEARYS_WEIGHT_OBJECT,
        permutations=PERMUTATIONS,
        n_jobs=1, 
        keep_simulations=False, 
        seed=42
    ).fit(mat)

    return lg

def get_folder_information(exp_id):
    # declare the various output directories
    processed_root = os.path.join(HBEC_ROOT, exp_id)
    register_root = os.path.join(processed_root, 'register')
    output_root = os.path.join(processed_root, 'spatial_auto_correlation', 'neighbor_size_{}_geary'.format(NEIGHBORHOOD_SIZE))

    # find all relevant data files in the data directory 
    file_ids = sorted([_d for _d in os.listdir(register_root) if os.path.isdir(os.path.join(register_root, _d))])

    return processed_root, register_root, output_root, file_ids

def process_files(file_id_):
    # choose data file to load
    file_root = os.path.join(register_root, file_id_)

    # load the downsampled movement mask and the shifts matrix
    shifts = np.load(os.path.join(file_root, 'shifts.npy'))
    sub_movement_mask = np.load(os.path.join(file_root, 'mask.npy'))

    # calculate average velocity field from shifts array
    velocity_fields = register.calculate_mean_velocity_field(shifts)
    norm_shifts = velocity_fields['normalised_velocity']
    mags = velocity_fields['speed']

    # we need gearys c index and p-values for speed, vx, vy - total 6 matrix results per 
    #  experiment: 
    #  gearys_i_speed, gearys_i_vx, gearys_i_vy, 
    #  gearys_p_speed, gearys_p_vx, gearys_p_vy
    geary_list = [mags, norm_shifts[:,:,1], norm_shifts[:,:,0]] 
    geary_objects = []
    for mat in tqdm(geary_list):
        geary_objects.append(process_geary(mat))

    m_speed = geary_objects[0]
    m_vx = geary_objects[1]
    m_vy = geary_objects[2]

    # get matrices from geary objects
    gearys_p_speed = np.reshape(m_speed.p_sim, mags.shape)
    gearys_i_speed = np.reshape(m_speed.localG, mags.shape)
    gearys_p_vx = np.reshape(m_vx.p_sim, mags.shape)
    gearys_i_vx = np.reshape(m_vx.localG, mags.shape)
    gearys_p_vy = np.reshape(m_vy.p_sim, mags.shape)
    gearys_i_vy = np.reshape(m_vy.localG, mags.shape)

    # ensure output directory exists
    sac_dir = os.path.join(output_root, file_id_)
    os.makedirs(sac_dir, exist_ok=True)

    # save the geary object matrices
    np.save(os.path.join(sac_dir, 'p_values_speed.npy'), gearys_p_speed)
    np.save(os.path.join(sac_dir, 'i_values_speed.npy'), gearys_i_speed)
    np.save(os.path.join(sac_dir, 'p_values_vx.npy'), gearys_p_vx)
    np.save(os.path.join(sac_dir, 'i_values_vx.npy'), gearys_i_vx)
    np.save(os.path.join(sac_dir, 'p_values_vy.npy'), gearys_p_vy)
    np.save(os.path.join(sac_dir, 'i_values_vy.npy'), gearys_i_vy)

    # save the registration parameters in the same folder
    params = {
        'neighborhood_size': NEIGHBORHOOD_SIZE,
        'permutations': PERMUTATIONS,
        'weights': 'constant = 1'
    }
    with open(os.path.join(sac_dir, 'params.json'), 'w') as json_f:
        json.dump(params, json_f)
        
    print(file_id_, 'done', time.asctime())

def save_geary_masks(file_id_):
    sac_file_root = os.path.join(output_root, file_id_)
    
    # load each of the associated spatial autocorrelation processed files
    # get matrices from geary objects
    gearys_p_speed = np.load(os.path.join(sac_file_root, 'p_values_speed.npy'))
    gearys_i_speed = np.load(os.path.join(sac_file_root, 'i_values_speed.npy'))
    gearys_p_vx = np.load(os.path.join(sac_file_root, 'p_values_vx.npy'))
    gearys_i_vx = np.load(os.path.join(sac_file_root, 'i_values_vx.npy'))
    gearys_p_vy = np.load(os.path.join(sac_file_root, 'p_values_vy.npy'))
    gearys_i_vy = np.load(os.path.join(sac_file_root, 'i_values_vy.npy'))
    
    # given the processed geary files, we can now create a geary mask based on i-value and p-value
    # for the p-values, we are only interested in those regions where p<threshold
    # so here we can convert the p-value matrices to boolean p-value masks
    p_value_threshold = 0.05
    gearys_p_speed_mask = gearys_p_speed < p_value_threshold
    gearys_p_vx_mask = gearys_p_vx < p_value_threshold
    gearys_p_vy_mask = gearys_p_vy < p_value_threshold

    # for the i-values, we can look at areas of spatial clusters and spatially dispersed regions
    # i value > 2 indicates that a feature has neighboring features with similarly high or low attribute values; this feature is part of a cluster.
    # i value < 2 indicates that a feature has neighboring features with dissimilar values; this feature is an outlier.
    # In either instance, the p-value for the feature must be small enough for the cluster or outlier to be considered statistically significant. 
    gearys_i_speed_mask = gearys_i_speed < 2
    gearys_i_vx_mask = gearys_i_vx < 2
    gearys_i_vy_mask = gearys_i_vy < 2
    
    # now we can combine a low p-value with positive i-value to get a mask for statistically significant correlated motion
    geary_speed_mask = gearys_i_speed_mask & gearys_p_speed_mask
    geary_vx_mask = gearys_i_vx_mask & gearys_p_vx_mask
    geary_vy_mask = gearys_i_vy_mask & gearys_p_vy_mask
    
    # save the geary object matrices
    sac_dir = os.path.join(output_root, file_id_)
    np.save(os.path.join(sac_dir, 'speed_mask.npy'), geary_speed_mask)
    np.save(os.path.join(sac_dir, 'vx_mask.npy'), geary_vx_mask)
    np.save(os.path.join(sac_dir, 'vy_mask.npy'), geary_vy_mask)

# Process

In [None]:
# save spatial autocorrelation geary matrices (i-values and p-values)
processed_root, register_root, output_root, file_ids = get_folder_information(experiment)
print(experiment, '--------------------------')
with Pool(NCPUS) as p:
    list(tqdm(p.imap(process_files, file_ids)))

In [None]:
# save geary masks
processed_root, register_root, output_root, file_ids = get_folder_information(experiment)
print(experiment, '--------------------------')
with Pool(NCPUS) as p:
    list(tqdm(p.imap(save_geary_masks, file_ids)))