<center><h2>Feature Space Generation</h2></center>

<img src='https://www.projectrhea.org/rhea/images/thumb/2/22/Hyperplane.png/700px-Hyperplane.png'/>

This notebook showcases how we can utilize the ADC in order to create a feature space based on parameters such as time range, bands and buffer zone for each parcel. In addition, a daily interpolation has been applied to the data. 

### Import Libraries 

In [1]:
import sys,logging
import xarray as xr
import pandas as pd
from pathlib import Path
import h5py
sys.path.append(str(Path('.').absolute().parent))
sys.path.insert(0,str(Path('.').absolute().parent))
from adc_utils import * # Please change the path for the ADC configuration (if you use the Cube)

### Load Sentinel Data
Open Data Cube (on which ADC has been based) can be installed via the instructions given in <a href="https://github.com/Agri-Hub/datacap"> DataCAP</a> repository. Nevertheless, we have uploaded <b>netCDF</b> files, extracted from the ADC, so users can easily load the required data and run these notebooks as a proof of concept. These data can be found here **https://zenodo.org/record/6458275#.YlcRz1vP2E1**


In [2]:
import xarray as xr
ws_datasets = '../datasets/Cyprus'
netcdf = []
for filename in sorted(os.listdir(ws_datasets)):
    if not 'optical' in filename:
        continue
    netcdf.append(xr.open_dataset(os.path.join(ws_datasets,filename)))
data_nc = xr.merge(netcdf)

### Overview of Optical Data loaded from netCDF

In [3]:
data_nc

### Loading Sentinel Data (if ADC is installed and configured): Setting the required parameters
The following parameters are used in order to query the ADC. As the data have been already exported in netCDF format, there is no need for using them. 

<code>
year = '2019'
pilot = 'cyprus'
region = 'new_area_1'
buffer = 'buffer5'
shapefile_bbox = 'full_path_of_desired_shapefile'
driver = ogr.GetDriverByName("ESRI Shapefile")
ds = driver.Open(shapefile_bbox,0)
layer = ds.GetLayer()
xmin,xmax,ymin,ymax = layer.GetExtent()
bbox = [xmin,xmax,ymin,ymax]
product_ids = 'ids_{}_{}'.format(pilot,year)
ids = getIDs(product_ids,xmin,xmax,ymin,ymax)
</code>

###  Set the parameters 

Except from the parameters above, both netCDF files and direct ADC query demand the following parameters:

In [4]:
year = '2019'

pilot = 'cyprus'

buffer = 'buffer5'


optical_bands = ['B02','B03','B04','B05','B06','B07','B08','B8A','B11','B12','ndvi','ndwi','psri']

sar_bands = ['vv','vh']

# two month, as a proof of concept. Change it!
start_date,end_date = '2019-01-01','2019-03-01' 

# number of breaks between the dates interval. Data loading takes place on each break reducing the usage of RAM
d_break = 8 

### Load IDs of declarations and correspondive crop types

Each declared parcel in LPIS comes with a unique id and it's crop type. We have converted unique ids to new ids named noa_id

In [5]:
dbf = pd.read_csv(os.path.join(ws_datasets,'cyprus_ids_cropTypes_2019.csv'))
dbf.head()

Unnamed: 0.1,Unnamed: 0,Decl_Code,noa_id,Description
0,0,1,1,Durum Wheat
1,1,1,2,Durum Wheat
2,2,1,3,Durum Wheat
3,3,1,4,Durum Wheat
4,4,3,5,Barley


In [6]:
noa_to_id = {k:v for k,v in dbf[['noa_id','Decl_Code']].values}

### Load the ids for each parcel based on a layer that has been indexed on the cube

**Before the process starts, the LPIS shapefile including farmers' declaration has been rasterized and indexed into the ADC. Thus, it enables fast processing by allowing xarray operations.**

In [7]:
ids = xr.open_dataset(os.path.join(ws_datasets,'ids_2019.nc'))

### An overview of the xarray of ids
Data variables contain instances of the same LPIS in different buffer zones. By using different buffers we can exlude mixels from data analysis 

In [8]:
ids

### Filter Data for a specific time period

In [9]:
start = pd.to_datetime(start_date).to_numpy()
end = pd.to_datetime(end_date).to_numpy()
data_nc = data_nc.where((data_nc.time>=start) & (data_nc.time<end),drop=True)

### Filtered Xarray

In [10]:
data_nc

### Get the dates related to the data and data intervals

In [11]:
from datetime import timedelta
dates_l = list(date_range(start_date,end_date,d_break))
dates_list = []
for i in range(len(dates_l)-1):
    start,end = dates_l[i],dates_l[i+1]
    start = datetime.strptime(start,"%Y-%m-%d")
    end = datetime.strptime(end,"%Y-%m-%d")
    end = end - timedelta(1)
    dates_list.append((start.strftime("%Y-%m-%d"),end.strftime("%Y-%m-%d")))
dates_list

[('2019-01-01', '2019-01-07'),
 ('2019-01-08', '2019-01-14'),
 ('2019-01-15', '2019-01-22'),
 ('2019-01-23', '2019-01-29'),
 ('2019-01-30', '2019-02-05'),
 ('2019-02-06', '2019-02-13'),
 ('2019-02-14', '2019-02-20'),
 ('2019-02-21', '2019-02-28')]

### Creation of a hdf5 file to write in

In [12]:
ws_output = '/home/eouser/Desktop' 
hf = h5py.File(os.path.join(ws_output,'fs_{0}_{1}_{2}_raw5.h5'.format(pilot,year,buffer)), 'a')
data = hf.create_group('data')
sar_data = hf.create_group('sar_data')
coords = hf.create_group('coords')
meta = hf.create_group('metadata')
meta.create_dataset('bands',(len(optical_bands)),'S10',[b.encode('ascii','ignore') for b in optical_bands])
meta.create_dataset('sar_bands',(len(sar_bands)),'S10',[b.encode('ascii','ignore') for b in sar_bands])

<HDF5 dataset "sar_bands": shape (2,), type "|S10">

### Iterate over time and get pixel data for each parcel

The iteration over the times periods (set before) include the data retrieval and the zonal statistics generations. Instead of iterating over every parcel, we exploit the xarray of ids. Specifically, the command <code> grouped_data = data_cube.groupby(ids[buffer][0]) </code> generates data groups per id decreasing the time complexity and allowing aggregations functions on each group. 

In [None]:
flag = True

for d_n in tqdm(range(len(dates_list))):  
    try:
        d_start,d_end = dates_list[d_n][0],dates_list[d_n][1]
        
        # ---------------------------------------------------#
        # ----Start of two approaches for loading data ------
        # ---------------------------------------------------#
        
        ################## By using ADC ###########################
        
        # data_cube = getData_optical(bbox,d_start,d_end,optical_bands)
        # data_cube = data_cube.load()

         ########### By using datasets provided in git ###########

        d_start = pd.to_datetime(d_start).to_numpy()
        d_end = pd.to_datetime(d_end).to_numpy()
        data_cube = data_nc.where((data_nc.time>=d_start)&(data_nc.time<=d_end),drop=True)
        
        # No need to change the two following lines:
        all_bands = ['B02','B03','B04','B05','B06','B07','B08','B8A','B11','B12','SCL']
        all_indices = ['ndvi','ndwi','ndmi','psri','savi','evi','dvi','rdvi','rvi','tvi','tcari','gi','vigreen','varigreen','gari','gdvi','sipi','wdrvi','gvmi','gcvi']
        
        bands = [b for b in optical_bands if b in all_bands]
        indices = [b for b in optical_bands if b in all_indices]
  
        # calculating new indices and drop cloudy images based on a user-defined threshold
        
        for i,index in enumerate(optical_bands):
            if index in indices:
                data_cube[index] = calculate_index(data_cube,index)
            data_cube[index] = cloud_data(data_cube,index)
            if i == 0:
                to_keep = data_cube[index].dropna(dim='time',thresh=0.25).time
                data_cube = data_cube.sel(time=to_keep)
        for b in all_bands:
            if b not in bands:
                data_cube = data_cube.drop(b)        
        # ---------------------------------------------------#
        # ------ End of two approaches for loading data ------
        # ---------------------------------------------------#
        ### important step: grouping of pixels per parcel based on id raster
        grouped_data = data_cube.groupby(ids[buffer][0])
    except Exception as e:
        print(e)
        continue
    
    dates = data_cube.time.values
    dates = np.array([str(t).split('T')[0] for t in dates])
    dates_unique = sorted(set(dates))  
    
    all_keys = []
    for f in grouped_data:
        key, parcel_data = f[0],f[1]
        if key!=-1:
            coords_all = list(zip(parcel_data.x.values,parcel_data.y.values))
            parcel_data = np.array([parcel_data[b].values for b in optical_bands])

            if len(dates)!=dates_unique:
                vals = []
                for b in range(parcel_data.shape[0]):
                    df = pd.DataFrame(data=parcel_data[b],index=dates)
                    vals.append(df.groupby(df.index).mean().values)
                vals = np.array(vals)
            else:
                vals = parcel_data.copy()
            if flag:
                coords.create_dataset(str(key),data=np.array(coords_all), compression='gzip',compression_opts=9)
                data.create_dataset(str(key),data=vals.astype('float64'),compression='gzip',
                                    compression_opts=9,maxshape=(vals.shape[0],None,vals.shape[2]))
            else:
                data[str(key)].resize((data[str(key)].shape[1] + vals.shape[1]), axis=1)
                data[str(key)][:,-vals.shape[1]:,:] = vals
            all_keys.append(key)

    if flag:
        meta.create_dataset('unique_ids',(len(all_keys)),'S40',
                            [noa_to_id[i] for i in np.array(sorted(all_keys))])
        meta.create_dataset('dates',(len(dates_unique)),'S10',
                            [d.encode('ascii','ignore') for d in dates_unique],maxshape=(None,))
        flag = False
    else:
        meta['dates'].resize((meta['dates'].shape[0] + len(dates_unique)), axis=0)
        meta['dates'][-len(dates_unique):] = dates_unique
    

 12%|█████▌                                      | 1/8 [04:10<29:12, 250.42s/it]

### The same for SAR data

In [None]:
# def getData_sar(bbox,timeStart,timeEnd,sar_bands,resolution = 10):
    
#     product_sar= 'sentinel1_sar'
#     all_sar_bands = ['vv','vh']
    
#     bands = [b for b in sar_bands if b in all_sar_bands]
#     if bbox is not None:
#         xmin,xmax,ymin,ymax = bbox[0],bbox[1],bbox[2],bbox[3]
#     query = {
#         'time': (timeStart,timeEnd),
#         'product': product_sar,
#         'x':(xmin,xmax),
#         'y':(ymin,ymax),
#         'crs':'EPSG:3857'
#     }
#     dc = datacube.Datacube(app="test", config=config)
#     data = dc.load(**query,measurements=all_sar_bands,dask_chunks={})
#     for b in all_sar_bands:
#         if b not in bands:
#             data = data.drop(b)
#     return data

# flag = True

# if sar_bands:
    
#     for d_n in tqdm(range(len(dates_list))):
        
#         try:        
#             d_start,d_end = dates_list[d_n][0],dates_list[d_n][1]
#             data_cube = getData_sar(bbox,d_start,d_end,sar_bands)
#             data_cube = data_cube.load()
#             grouped_data = data_cube.groupby(ids[buffer][0])
#         except:
#             continue
#         sar_dates = data_cube.time.values
#         sar_dates = np.array([str(t).split('T')[0] for t in sar_dates])
#         dates_unique = sorted(set(sar_dates))  


#         for f in grouped_data:
#             key, parcel_data = f[0],f[1]
# #             if key in noa_ids_to_keep:
#             if key!=-1:
#                 parcel_data = np.array([parcel_data[b].values for b in sar_bands])

#                 if len(sar_dates)!=dates_unique:
#                     vals = []
#                     for b in range(parcel_data.shape[0]):
#                         df = pd.DataFrame(data=parcel_data[b],index=sar_dates)
#                         vals.append(df.groupby(df.index).mean().values)
#                     vals = np.array(vals)
#                 else:
#                     vals = parcel_data.copy()
                    
#                 if flag:
#                     sar_data.create_dataset(str(key),data=vals.astype('float64'),compression='gzip',
#                                         compression_opts=9,maxshape=(vals.shape[0],None,vals.shape[2]))
#                 else:
#                     sar_data[str(key)].resize((sar_data[str(key)].shape[1] + vals.shape[1]), axis=1)
#                     sar_data[str(key)][:,-vals.shape[1]:,:] = vals


#         dates_unique = sorted(set(sar_dates))   
#         if flag:
#             meta.create_dataset('sar_dates',(len(dates_unique)),'S10',
#                                 [d.encode('ascii','ignore') for d in dates_unique],maxshape=(None,))
#             flag = False
#         else:
#             meta['sar_dates'].resize((meta['sar_dates'].shape[0] + len(dates_unique)), axis=0)
#             meta['sar_dates'][-len(dates_unique):] = dates_unique

In [None]:
hf.close()

## Filtering and Daily Linear Interpolation

In [None]:
# ### interpolation timestamps

# def date_range(start, end, intv):

#     start = datetime.strptime(start,"%Y-%m-%d")#+timedelta(days=31)
#     end = datetime.strptime(end,"%Y-%m-%d")-timedelta(days=1)
#     diff = (end  - start ) / intv
#     for i in range(intv):
#         yield (start + diff * i)
#     yield end

# s2_temporal_resolution = 5
# s1_temporal_resolution = 6

# start = datetime.strptime(start_date,"%Y-%m-%d")#+timedelta(days=31)
# end = datetime.strptime(end_date,"%Y-%m-%d")-timedelta(days=1)
# d_break = (end-start).days//s2_temporal_resolution
# d_break_sar = (end-start).days//s1_temporal_resolution
    
# dates_interp = list(date_range(start_date,end_date,d_break))
# dates_interp = [d.date() for d in dates_interp]
# dates_interp_sar = list(date_range(start_date,end_date,d_break_sar))
# dates_interp_sar = [d.date() for d in dates_interp_sar]

In [None]:
# def daily_interpolation(vals,dates_timestamp,sar=False,smoothing=False):

#     df_band = pd.DataFrame(data=vals.T.copy(),columns=dates_timestamp)
#     #     df_band[df_band==-9999] = np.nan
#     start = datetime.strptime(start_date,"%Y-%m-%d")-timedelta(days=1)
#     end = datetime.strptime(end_date,"%Y-%m-%d")+timedelta(days=1)
#     df_band[start] = np.nan
#     df_band[end] = np.nan
#     df_band = df_band[sorted(df_band.columns)]
#     df_band = df_band.resample('D',axis=1).mean().interpolate('linear',axis=1).bfill(axis=1).ffill(axis=1)
#     if sar:
#         df_band = df_band[dates_interp_sar]
#     else:
#         df_band = df_band[dates_interp]
#     if smoothing:
#         df_band = df_band.rolling(window=3,center=True,axis=1).median().bfill(axis=1).ffill(axis=1)
    
#     return df_band.values.T

In [None]:
# hf = h5py.File('/home/eouser/Desktop/jason_notebooks/fs/{0}/fs_{0}_{1}_{2}_{3}_raw.h5'.format(pilot,region,year,buffer), "r", libver='latest', swmr=True)
# unique_ids = np.array([str(i.decode('UTF-8')) for i in hf['metadata']['unique_ids'][:]])
# dates = np.array([str(d.decode('UTF-8')) for d in hf['metadata']['dates'][:]])
# dates_timestamp = np.array([datetime.strptime(x,"%Y-%m-%d") for x in dates])
# sar_dates = np.array([str(d.decode('UTF-8')) for d in hf['metadata']['sar_dates'][:]])
# sar_dates_timestamp = np.array([datetime.strptime(x,"%Y-%m-%d") for x in sar_dates])
# bands = np.array([str(b.decode('UTF-8')) for b in hf['metadata']['bands'][:]])
# sar_bands = np.array([str(b.decode('UTF-8')) for b in hf['metadata']['sar_bands'][:]])
# ids = np.array(list(hf['data'].keys()))

# # ids = sorted(set(ids.astype(int)).intersection(set(noa_ids_to_keep)))
# ids = np.array(ids).astype(str)
# unique_ids = np.array([noa_to_id[int(i)] for i in ids])

# ndvi_i = np.where(bands=='ndvi')[0][0]
# lower_ndvi_thresh = 0 
# sample = True # put yes if you want to extract a sample of random pixels inside the parcel
# # sample_size = 0.2 # the number the sample pixels > 1 or the portion <= 1
# sample_size = 10


# dates_new = [d.strftime("%Y-%m-%d") for d in dates_interp]
# sar_dates_new = [d.strftime("%Y-%m-%d") for d in dates_interp_sar]


# hf_interp = h5py.File('/home/eouser/Desktop/jason_notebooks/fs/{0}/fs_{0}_{1}_{2}_{3}_interp.h5'.format(pilot,region,year,buffer), 'w')
# data_interp = hf_interp.create_group('data')
# sar_data_interp = hf_interp.create_group('sar_data')
# coords_interp = hf_interp.create_group('coords')
# meta_interp = hf_interp.create_group('metadata')
# meta_interp.create_dataset('bands',(len(bands)),'S10',[b.encode('ascii','ignore') for b in bands])
# meta_interp.create_dataset('sar_bands',(len(sar_bands)),'S10',[b.encode('ascii','ignore') for b in sar_bands])
# meta_interp.create_dataset('dates',(len(dates_new)),'S10',[d.encode('ascii','ignore') for d in dates_new])
# meta_interp.create_dataset('sar_dates',(len(sar_dates_new)),'S10',[d.encode('ascii','ignore') for d in sar_dates_new])
# meta_interp.create_dataset('unique_ids',(len(unique_ids)),'S40',[i.encode('ascii','ignore') for i in unique_ids])


# for i in tqdm(ids):
    
#     vals = hf['data'][i][:]
#     sar_vals = hf['sar_data'][i][:]
#     coords = hf['coords'][i][:]
    
#     if sample:
#         vals_size = vals.shape[-1]
#         if sample_size <= 1:
#             sample_size = int(vals_size*sample_size)
#             s = np.sort(np.random.choice(np.arange(vals_size),sample_size,replace=False))
#             vals = vals[:,:,s]
#             sar_vals = sar_vals[:,:,s]
#             coords = coords[s,:]
#         else:
#             vals_size = vals.shape[-1]
#             if sample_size<=vals_size:
#                 s = np.sort(np.random.choice(np.arange(vals_size),sample_size,replace=False))
#             else:
#                 s = np.sort(np.random.choice(np.arange(vals_size),sample_size,replace=True))
#             vals = vals[:,:,s]
#             sar_vals = sar_vals[:,:,s]
#             coords = coords[s,:]
    
    
#     vals[vals==-9999.] = np.nan
#     sar_vals[(sar_vals==0.)|(sar_vals<=-30.)] = np.nan
    
#     ii = np.where(vals[ndvi_i,:,:]<=lower_ndvi_thresh) ## put nan for every band if ndvi<=ndvi_threshold
    
#     vals_new = np.zeros((vals.shape[0],len(dates_new),vals.shape[-1]))
#     sar_vals_new = np.zeros((sar_vals.shape[0],len(sar_dates_new),sar_vals.shape[-1]))
#     for b in range(len(bands)):
#         vals[b,ii[0],ii[1]] = np.nan
#         vals_new[b,:,:] = daily_interpolation(vals[b,:,:],dates_timestamp,sar=False)
#     for b in range(len(sar_bands)):
#         sar_vals_new[b,:,:] = daily_interpolation(sar_vals[b,:,:],sar_dates_timestamp,sar=True)

#     coords_interp.create_dataset(str(i),data=np.array(coords), compression='gzip',compression_opts=9)
#     data_interp.create_dataset(str(i),data=vals_new.astype('float16'),compression='gzip',compression_opts=9)
#     sar_data_interp.create_dataset(str(i),data=sar_vals_new.astype('float16'),compression='gzip',compression_opts=9)


In [None]:
# hf.close()
# hf_interp.close()