# Making tools to use covariate xarrays with a Keras model

### Loading covariates and targets

These should already exist in a bunch of pickle files

In [6]:
import pickle
with open('site_and_points.pkl','rb') as f:
    final_df = pickle.load(f)

In [19]:
with open("quantile_img.pkl","rb") as f:
    quant_raster = pickle.load(f)

In [8]:
import xarray as xr
tpi = xr.open_rasterio('SOC_geotiff/TPI_ablers.tif')
saga = xr.open_rasterio('SOC_geotiff/sagawetness_albers.tif')

In [7]:
final_df.head()

Unnamed: 0,SampleID,Easting,Northing,TC,Method,Year,points
0,2001_A1.2,338014.132,6370645.57,0.981252,CNS,2001,"Geometry({'type': 'Point', 'coordinates': (178..."
1,A1MIR,338014.132,6370645.57,0.600364,MIR,2001,"Geometry({'type': 'Point', 'coordinates': (178..."
2,2001_A6.2,338068.776,6370868.38,0.866419,CNS,2001,"Geometry({'type': 'Point', 'coordinates': (178..."
3,A6MIR,338068.776,6370868.38,1.187051,MIR,2001,"Geometry({'type': 'Point', 'coordinates': (178..."
4,2001_A11.2,338182.533,6370550.16,0.772519,CNS,2001,"Geometry({'type': 'Point', 'coordinates': (178..."


Now we have a DataFrame with all the position and target information about the site measurements, and raster maps of the TPI and SAGA wetness, along with quantiles of photosynthetic vegetation cover observed by Landsat. We should combine these separate rasters into one huge multi-channel raster, then write a function to select from this raster and produce a 'window' around a site measurement for input into the neural network.

In [17]:
topographic = xr.concat((saga,tpi),dim='band').rename({'band':'channel'})

In [15]:
topographic.shape

(2, 886, 659)

In [20]:
quant_raster = quant_raster.rename({'quantile':'channel'})

In [21]:
covars = xr.concat((topographic,quant_raster),dim='channel')

In [31]:
# remove incompatible metadata in the dimension that was concatenated
import numpy as np; covars['channel'] = np.arange(len(covars['channel']))

In [30]:
covars.channel

<xarray.DataArray 'channel' (channel: 9)>
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
Coordinates:
  * channel  (channel) int64 0 1 2 3 4 5 6 7 8

### Function to create training/validation samples

This function takes a point (x,y) and a specified buffer around it (in pixels), then returns a trimmed raster of the covariates around the point. It should deal with cases where the buffer zone intersects the edge of the covariate raster map.

In [97]:
#determine resolution of rasters by differencing the spatial dimensions.
#ensure that if the raster contains covariates from different sources that they are coregistered
#to the same spatial coordinate sets otherwise this won't work and you'll end up with a bunch of
#NaNs in your underlying numpy arrays.
xres = covars.x[1]-covars.x[0]
yres = covars.y[1]-covars.y[0]

def sample_raster(row,bufferx=5,buffery=5):
    LL = row['points']
    sitex = LL.coords[0][0]
    sitey = LL.coords[0][1]
    
    x = np.arange(sitex-bufferx*xres,sitex+(bufferx+1)*xres,xres)
    y = np.arange(sitey-buffery*yres,sitey+(buffery+1)*yres,yres)
    
    sample_array = covars.reindex(x=x,method='nearest',tolerance=abs(xres/2)).reindex(y=y,method='nearest',tolerance=abs(yres/2))
    
    return sample_array.data
    
    
        

It may help to standardise the inputs for training. We can do this simply using built-in features of xarray before generating training samples. We can then either impute missing values (NaNs) on-the-fly using Keras or do it using our sample generating function while creating the training/validation set. It is less costly to do the latter because once it's done it will not need to be done again.

In [64]:
covars = (covars - covars.mean(dim=['x','y']))/covars.std(dim=['x','y'])

In [67]:
#save the standardised covariate raster - this will come in handy later on
with open("standardised_NN_covars.pkl","wb") as f:
    pickle.dump(covars,f)

## Generating labelled datasets for training and validation
We can now iterate through the dataframe and save the input data associated with each sample site in a directory in the 'normal' way for use with a Keras generator. This avoids loading every sample into RAM to train the NN. The labels are the measured SOC values in the dataframe. We will need to associate each row of the dataframe with a unique file on disk which can be read by the generator which feeds samples to Keras.

In [None]:
from tensorflow.keras.utils import Sequence

In [73]:
len(final_df)

2183

In [80]:
final_df.iloc[0:10]['TC']

0    0.981252
1    0.600364
2    0.866419
3    1.187051
4    0.772519
5    1.398617
6    0.593211
7    1.126836
8    1.315066
9    1.881963
Name: TC, dtype: float64

In [112]:
class CovarGenerator(Sequence):
    """
    Feed trimmed covariate images for an NN
    """
    
    def __init__(self,gen_df,batch_size = 32, shuffle = True):
        self.batch_size = batch_size
        
        self.length = len(gen_df)//batch_size
        
        self.shuffle = shuffle
        
        if self.shuffle:
            self.gen_df = gen_df.sample(frac=1).reset_index(drop=True)
        else:
            self.gen_df = gen_df
        
    def __getitem__(self,index):
        slicedf = self.gen_df.iloc[index*self.batch_size:(index+1)*self.batch_size]
        y = np.array(slicedf['TC'])
        X = np.stack(slicedf.apply(sample_raster,axis=1))
        return (X,y)
        
        
    def __len__(self):
        return self.length
    
    def on_epoch_end(self):
        if self.shuffle:
            self.gen_df = gen_df.sample(frac=1).reset_index(drop=True)


SyntaxError: invalid syntax (<ipython-input-112-24ca3002f14a>, line 28)

In [104]:
testgen = CovarGenerator(final_df)

In [105]:
X,y = testgen[0]

In [106]:
X.shape

(32, 9, 11, 11)

In [107]:
len(testgen)

68

In [111]:
y.shape

(32,)