# IMERG data preprocessing

**Data Access**: https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGDF_07/summary

-----------------

## GPM IMERG Final Precipitation L3 1 day 0.1 degree x 0.1 degree V07

NASA/GSFC/SED/ESD/GCDC/GESDISC

Version 07 is the current version of the data set. Older versions will no longer be available and have been superseded by Version 07. The Integrated Multi-satellitE Retrievals for GPM (IMERG) IMERG is a NASA product estimating global surface precipitation rates at a high resolution of 0.1° every half-hour beginning 2000. 

It is part of the joint NASA-JAXA Global Precipitation Measurement (GPM) mission, using the GPM Core Observatory satellite as the standard to combine precipitation observations from an international constellation of satellites using advanced techniques. 

IMERG can be used for global-scale applications as well as over regions with sparse or no reliable surface observations. The fine spatial and temporal resolution of IMERG data allows them to be accumulated to the scale of the application for increased skill. 

IMERG has three Runs with varying latencies in response to a range of application needs: rapid-response applications (Early Run, 4-h latency), same/next-day applications (Late Run, 14-h latency), and post-real-time research (Final Run, 3.5-month latency). 

While IMERG strives for consistency and accuracy, satellite estimates of precipitation are expected to have lower skill over frozen surfaces, complex terrain, and coastal zones. As well, the changing GPM satellite constellation over time may introduce artifacts that affect studies focusing on multi-year changes. 

This dataset is the GPM Level 3 IMERG *Final* Daily 10 x 10 km (GPM_3IMERGDF) derived from the half-hourly GPM_3IMERGHH. The derived result represents the Final estimate of the daily mean precipitation rate in mm/day. The dataset is produced by first computing the mean precipitation rate in (mm/hour) in every grid cell, and then multiplying the result by 24. This minimizes the possible dry bias in versions before "07", in the simple daily totals for cells where less than 48 half-hourly observations are valid for the day. The latter under-sampling is very rare in the combined microwave-infrared and rain gauge dataset, variable "precipitation", and appears in higher latitudes. Thus, in most cases users of global "precipitation" data will not notice any difference. This correction, however, is noticeable in the high-quality microwave retrieval, variable "MWprecipitation", where the occurrence of less than 48 valid half-hourly samples per day is very common. The counts of the valid half-hourly samples per day have always been provided as a separate variable, and users of daily data were advised to pay close attention to that variable and use it to calculate the correct precipitation daily rates. Starting with version "07", this is done in production to minimize possible misinterpretations of the data. The counts are still provided in the data, but they are only given to gauge the significance of the daily rates, and reconstruct the simple totals if someone wishes to do so. 

In [1]:
import os
import sys
import yaml
import dask
import zarr
import numpy as np
import xarray as xr
import pandas as pd
from glob import glob

import calendar
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta

sys.path.insert(0, os.path.realpath('../libs/'))
import verif_utils as vu

In [2]:
base_dir = '/glade/campaign/cisl/aiml/ksha/IMERG_V7/daily/'
output_dir = '/glade/campaign/cisl/aiml/ksha/IMERG_V7/daily/gather_yearly/'

In [3]:
files = sorted(glob(base_dir+'*.nc4'))

In [4]:
varname_drop = [
    'precipitation_cnt', 'precipitation_cnt_cond', 'MWprecipitation', 'MWprecipitation_cnt', 'time_bnds',
    'MWprecipitation_cnt_cond', 'randomError', 'randomError_cnt', 'probabilityLiquidPrecipitation'
]

dict_rename = {
    'lon': 'longitude',
    'lat': 'latitude'
}

In [5]:
for year in range(2000, 2010):
    
    print(f'Processing year {year}')
    ds_collection = []
    files = sorted(glob(base_dir+f'3B-DAY.MS.MRG.3IMERG.{year}*.nc4'))
    assert len(files) >= 365, f'year {year} has missing files'
    
    for fn in files:
        ds = xr.open_dataset(fn)
        ds = ds.drop_vars(varname_drop)
        ds = ds.rename(dict_rename)
        ds['latitude'] = ds['latitude'].astype('float32')
        ds['longitude'] = ds['longitude'].astype('float32')
        ds.attrs = {} # clear attributes
        ds_collection.append(ds)
    
    ds_all = xr.concat(ds_collection, dim='time')
    ds_all = ds_all.assign_coords(time=ds_all['time'] + pd.Timedelta(days=1))
    
    save_names = output_dir + f'year_{year}.zarr'
    # ds_all.to_zarr(save_names, mode='w')
    print(f'Save to {save_names}')

## Note: IMERG time is NOT ending time

In [10]:
ds = xr.open_dataset(fn)

In [13]:
ds['time_bnds'].values

array([['2009-12-31T00:00:00.000000000', '2009-12-31T23:59:59.000026752']],
      dtype='datetime64[ns]')

In [15]:
ds['time'].values

array(['2009-12-31T00:00:00.000000000'], dtype='datetime64[ns]')