# Missing values

After the merging, we have noticed that the number of missing values were very high (around 50% for MODIS products and up to 70% for Era product. Let's see how many missing values we have for each product exactly.

## Importing datasets

In [3]:
import numpy as np
import rasterio
import matplotlib.pyplot as plt
import xarray as xr
import rioxarray as rxr
import geopandas as gpd
import harmonize as hz
import matplotlib.dates as mdates
from matplotlib.widgets import Cursor
from matplotlib import animation


In [4]:
# Open the dataset with xarray
path_data = "../data/Raw/"
ndvi = xr.open_dataset(path_data +'Raw_NDVI_16D_1km.nc')
lai = xr.open_dataset(path_data +'Raw_LAI_8D_500m.nc')
evap = xr.open_dataset(path_data +'Raw_Evap_8D_500m.nc')
era = xr.open_dataset(path_data +'Raw_weather_4H_9km.nc')
lst_night = xr.open_dataset(path_data +'Raw_LST_Night_1D_1km.nc')
lst_day = xr.open_dataset(path_data +'Raw_LST_Day_1D_1km.nc')
active_fire = xr.open_dataset(path_data +'Raw_ActiveFire_500m.nc')
burn_mask = xr.open_dataset(path_data +'Raw_BurnMask_1km.nc')
fwi = xr.open_mfdataset(path_data+'/Raw_Fwi/*.nc', combine='by_coords', chunks=None)
density = rxr.open_rasterio(path_data +'fra_pd_2015_1km_UNadj.tif').squeeze()
ndvi_large = xr.open_dataset(path_data +'Raw_NDVI_Large.nc')
ndvi_ideal = xr.open_dataset(path_data +'Raw_NDVI_Ideal.nc')

In [6]:
# Select the variables of interest
ndvi_filter = ndvi['_1_km_16_days_EVI']
lai_filter = lai['Fpar_500m']
evap_filter = evap['ET_500m']
era_filter = era[['u10', 'v10', 't2m', 'tp']]
lst_night_filter = lst_night['LST_Night_1km']
lst_day_filter = lst_day['LST_Day_1km']
# fwi_filter = fwi['fwi-daily-proj']
active_fire_filter = active_fire[['First_Day', 'Last_Day', 'Burn_Date']]
burn_mask_filter = burn_mask['FireMask']

# Import the datacube1


In [8]:
# Import the datacube
datacube1 = xr.open_dataset(path_data +'datacube1.nc')

In [11]:
datacube1

In [12]:
# Measure percentage of missing values of all variables in datacube1
print("Percentage of missing values in EVI : ", datacube1["_1_km_16_days_EVI"].isnull().sum().values /datacube1["_1_km_16_days_EVI"].size*100)
print("Percentage of missing values in LAI : ", datacube1["Fpar_500m"].isnull().sum().values /datacube1["Fpar_500m"].size*100)
print("Percentage of missing values in Evap : ", datacube1["ET_500m"].isnull().sum().values /datacube1["ET_500m"].size*100)
print("Percentage of missing values in u10 : ", datacube1["u10"].isnull().sum().values /datacube1["u10"].size*100)
print("Percentage of missing values in v10 : ", datacube1["v10"].isnull().sum().values /datacube1["v10"].size*100)
print("Percentage of missing values in t2m : ", datacube1["t2m"].isnull().sum().values /datacube1["t2m"].size*100)
print("Percentage of missing values in tp : ", datacube1["tp"].isnull().sum().values /datacube1["tp"].size*100)
print("Percentage of missing values in LST_Night : ", datacube1[""].isnull().sum().values /datacube1["LST_Night_1km"].size*100)
#print("Percentage of missing values in LST_Day : ", datacube1["LST_Day_1km"].isnull().sum().values /datacube1["LST_Day_1km"].size*100)
#print("Percentage of missing values in First_Day : ", datacube1["First_Day"].isnull().sum().values /datacube1["First_Day"].size*100)
print("Percentage of missing values in Last_Day : ", datacube1["Last_Day"].isnull().sum().values /datacube1["Last_Day"].size*100)
print("Percentage of missing values in Burn_Date : ", datacube1["Burn_Date"].isnull().sum().values /datacube1["Burn_Date"].size*100)
print("Percentage of missing values in FireMask : ", datacube1["FireMask"].isnull().sum().values /datacube1["FireMask"].size*100)





Percentage of missing values in EVI :  50.8763144013069
Percentage of missing values in LAI :  50.647266360718355
Percentage of missing values in Evap :  50.06233917818395
Percentage of missing values in u10 :  61.1471607432143
Percentage of missing values in v10 :  61.1471607432143
Percentage of missing values in t2m :  61.1471607432143
Percentage of missing values in tp :  61.1471607432143


KeyError: ''

In [7]:
# Measure percentage of missing values in all datasets
print("Percentage of missing values in NDVI dataset : ", ndvi_filter.isnull().sum().values / ndvi_filter.size * 100)
print("Percentage of missing values in LAI dataset : ", lai_filter.isnull().sum().values / lai_filter.size * 100)
print("Percentage of missing values in Evaporation dataset : ", evap_filter.isnull().sum().values / evap_filter.size * 100)
print("Percentage of missing values in ERA5 dataset : ", era_filter['t2m'].isnull().sum().values / era_filter['t2m'].size * 100)
print("Percentage of missing values in LST Night dataset : ", lst_night_filter.isnull().sum().values / lst_night_filter.size * 100)
print("Percentage of missing values in LST Day dataset : ", lst_day_filter.isnull().sum().values / lst_day_filter.size * 100)
print("Percentage of missing values in Active Fire dataset : ", active_fire_filter['Burn_Date'].isnull().sum().values / active_fire_filter['Burn_Date'].size * 100)
print("Percentage of missing values in Burn Mask dataset : ", burn_mask_filter.isnull().sum().values / burn_mask_filter.size * 100)
# print("Percentage of missing values in FWI dataset : ", fwi_filter.isnull().sum().values / fwi_filter.size * 100)


Percentage of missing values in NDVI dataset :  50.737450160571704
Percentage of missing values in LAI dataset :  50.45910641484317
Percentage of missing values in Evaporation dataset :  50.424827356685256
Percentage of missing values in ERA5 dataset :  19.973544973544975
Percentage of missing values in LST Night dataset :  73.6891316325865
Percentage of missing values in LST Day dataset :  71.74670294292552
Percentage of missing values in Active Fire dataset :  52.32569436292081
Percentage of missing values in Burn Mask dataset :  50.06101281269066


We notice that the number of missing value from the MODIS are quiet similar in the original dataset and in the final datacube. However, th number of missing value from ERA 5 is much higher in the final datacube. It went from 19% to 61%. The ERA dataset has been much more process than the MODIS ones. So we will create ERA objects for each transformation and look at the missing values when it increased the most.

In [13]:
# Get the subset of the era_filter for the variable 't2m' and the year 2018
era_filter_subset = era_filter['t2m'].sel(time='2018')

In [15]:
# Write CRS
era_filter_subset_crs = hz.define_crs(era_filter_subset, 4326)
# Define the AOI
aoi = hz.define_area_of_interest(path_data + 'AreaOfInterest.zip')
# Clip the data sets to the AOI
era_filter_subset_crs_clip = hz.clip_to_aoi(era_filter_subset, aoi)

In [16]:
era_filter_subset_crs_clip_daily = hz.resample_to_daily(era_filter_subset_crs_clip)

In [17]:
 #   Definition of the common grid
common_grid = rxr.open_rasterio(path_data + 'Raw_LST_Day_1D_1km.nc').isel(time=0)
# Create a CRS object from a poj4 string for sinuoidal projection
crs_sinu = rasterio.crs.CRS.from_string("+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs")
# Projection of the era into sinuoidal projection
era_sinu = era_filter_subset_crs_clip_daily.rio.reproject(crs_sinu)
# Regrid the era data to the common grid
era_filter_final = hz.interpolate_to_common_grid(era_sinu, common_grid)
# Directly regrid the era data to the common grid
era_filter_final_direct = hz.interpolate_to_common_grid(era_filter_subset_crs_clip_daily, common_grid)

In [21]:
# Measure the percentage of missing value for the variable t2m in the all version of era_filter
print('Original t2m')
print('The percentage of missing value is: ', era_filter['t2m'].isnull().sum().values / era_filter['t2m'].size * 100)
print('-------------------')
print('Original t2m subset')
print('The percentage of missing value is: ', era_filter_subset.isnull().sum().values / era_filter_subset.size * 100)
print('-------------------')
print('Original t2m subset crs')
print('The percentage of missing value is: ', era_filter_subset_crs.isnull().sum().values / era_filter_subset_crs.size * 100)
print('-------------------')
print('Original t2m subset crs clip')
print('The percentage of missing value is: ', era_filter_subset_crs_clip.isnull().sum().values / era_filter_subset_crs_clip.size * 100)
print('-------------------')
print('Original t2m subset crs clip daily')
print('The percentage of missing value is: ', era_filter_subset_crs_clip_daily.isnull().sum().values / era_filter_subset_crs_clip_daily.size * 100)
print('-------------------')
print('Original t2m subset crs clip daily sinu')
print('The percentage of missing value is: ', era_sinu.isnull().sum().values / era_sinu.size * 100)
print('-------------------')
print('Original t2m subset crs clip daily sinu common grid')
print('The percentage of missing value is: ', era_filter_final.isnull().sum().values / era_filter_final.size * 100)
print('-------------------')
print('Original t2m subset crs clip daily common grid')
print('The percentage of missing value is: ', era_filter_final_direct.isnull().sum().values / era_filter_final_direct.size * 100)
print('-------------------')
print("Percentage of missing values in the final datacube (after merge): ", datacube1["u10"].sel(time='2018').isnull().sum().values /datacube1["u10"].sel(time='2018').size*100)

Original t2m
The percentage of missing value is:  19.973544973544975
-------------------
Original t2m subset
The percentage of missing value is:  19.973544973544975
-------------------
Original t2m subset crs
The percentage of missing value is:  19.973544973544975
-------------------
Original t2m subset crs clip
The percentage of missing value is:  47.878787878787875
-------------------
Original t2m subset crs clip daily
The percentage of missing value is:  48.02158572021586
-------------------
Original t2m subset crs clip daily sinu
The percentage of missing value is:  46.106925418569254
-------------------
Original t2m subset crs clip daily sinu common grid
The percentage of missing value is:  51.03890756904096
-------------------
Original t2m subset crs clip daily common grid
The percentage of missing value is:  50.90809522650144
-------------------
Percentage of missing values in the final datacube (after merge):  61.137491047032924


It's interesting that the biggest gap appeard while clipping the data to the AOI. It went from 19% to 48%. We will look at the data to see if there is a problem with the AOI. Indeed, we have been optimist while preparing our AOI which contains a lot of edges which is not recommended. We will create new AOI and test it.


We have two AOI. Both are rectangular. One is almost as the same size as the original AOI, (NDVI_large). Another is small, and does not include any sea nor mountains (NDVI_ideal).

In [None]:
ndvi_large = xr.open_dataset(path_data +'Raw_NDVI_Large.nc')
ndvi_ideal = xr.open_dataset(path_data +'Raw_NDVI_Ideal.nc')

In [22]:
# Set the time range of ndvi dataset to the same time range for as ndvi_large dataset
ndvi_subset = ndvi.sel(time=slice(ndvi_large.time[0].values, ndvi_large.time[-1].values))

In [23]:
# Subset ndvi, nedvi_large and ndvi_ideal datasets to only keep the '_1_km_16_days_EVI' variable
ndvi_subset = ndvi_subset['_1_km_16_days_EVI']
ndvi_large = ndvi_large['_1_km_16_days_EVI']
ndvi_ideal = ndvi_ideal['_1_km_16_days_EVI']

In [24]:
# Measure percentage of missing values
print(' NDVI: ', ndvi_subset.isnull().sum().values / ndvi_subset.size * 100, '%')
print(' NDVI Large: ', ndvi_large.isnull().sum().values / ndvi_large.size * 100, '%')
print(' NDVI Ideal: ', ndvi_ideal.isnull().sum().values / ndvi_ideal.size * 100, '%')

 NDVI:  50.710469501746545 %
 NDVI Large:  7.627092928265254 %
 NDVI Ideal:  5.747248199134796 %


Our hypothesis was right. By just clipping the data to the AOI, we lost a lot of data because it was really edgy. The difference induced by sea and mountains is not that big. We will use a new area of interest , the large one here and rebuild our datacube.