<a href="https://colab.research.google.com/github/Sugirlstar/GPU_griddedCalcualtion_draft/blob/main/ERA5Land_precipitation_hourly2daily.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Data preprocessing (time merging, crop, etc.) for ERA5-Land hourly data

This is the data preprocessing of ERA5-Land hourly data, taking China as the study region, taking 1970-2022 as the study period, for example.

*   Data download Link: https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-land?tab=overview
*   Data document: https://confluence.ecmwf.int/display/CKB/ERA5-Land%3A+data+documentation
---
## There are two main types of data in ERA5:
1. instantaneous

For this type, the daily value is the mean of every timesteps within a day (00:00-23:00).

2. accumulations

For this type, the daily value is the first value of the next day (the value of day i was put in 00:00 of day i+1). Reference:

[Please, note that the convention for accumulations used in ERA5-Land differs with that for ERA5. The accumulations in the short forecasts of ERA5-Land (with hourly steps from 01 to 24) are treated the same as those in ERA-Interim or ERA-Interim/Land, i.e., they are accumulated from the beginning of the forecast to the end of the forecast step. For example, runoff at day=D, step=12 will provide runoff accumulated from day=D, time=0 to day=D, time=12. The maximum accumulation is over 24 hours, i.e., from day=D, time=0 to day=D+1,time=0 (step=24).

HRES: accumulations are from 00 UTC to the hour ending at the forecast step
For the CDS time, or validity time, of 00 UTC, the accumulations are over the 24 hours ending at 00 UTC i.e. the accumulation is during the previous day
Synoptic monthly means (stream=mnth): accumulations have units of "variable_units per forecast_step hours"
Monthly means of daily means (stream=moda): accumulations have units that include "per day", see section Monthly means
---- from: https://confluence.ecmwf.int/display/CKB/ERA5-Land%3A+data+documentation#heading-Accumulations]


# Step 0 Download the data
Say, we want to study the period 1970-2022, and we have downloaded the data from year 1970 to 2022, each nc data file contains data within a year, and named with the number of year as the end (e.g. tatal_precipitation_2022.nc).
* **Important Note**: We must also download the first day of 2023 (a day after the intended period, because the data of the last day i was included in day i+1.) So, we also have a file named "tatal_precipitation_2023.nc", it only included the 00:00 data at 2023.01.01.

# Step 1 convert each hourly file to daily file (say, each file contains 1-year data)

In [None]:
import xarray as xr
import netCDF4 as nc
import pandas as pd
import numpy as np
import os

# set the path
# please put all the nc files into the path in advance, and clear all other unrelevant files
directory = 'H:/total_precipitation/'

# Get all files in the directory that end in .nc
files = [f for f in os.listdir(directory) if f.endswith('.nc')]

for file in files:
    # Read data for each year
    data = xr.open_dataset(os.path.join(directory, file))['tp']
    # Remove the file suffix ".nc" to get a plain file name
    pure_filename = file.rsplit('.', 1)[0]

    # get the year
    year = pure_filename[-4:]

    # Extract data at 00:00 per day
    daily = data.sel(time=data.time.dt.hour == 0)

    # Change the time label directly to the previous day
    daily['time'] = daily.time - pd.Timedelta(days=1)

    # Write to a netCDF file
    daily.to_netcdf(f'D:/hyj/ERA5Land_tp_daily/daily_{year}.nc')

    # delete and release the RAM
    del data, daily


# Step 2 convert all daily files into one file with multiple years

In [None]:
import xarray as xr
import netCDF4 as nc
import pandas as pd
import numpy as np
import os

directory = 'D:/hyj/ERA5Land_tp_daily/'

datasets = []

files = [f for f in os.listdir(directory) if f.endswith('.nc')]

for file in files:
    ds = xr.open_dataset(os.path.join(directory, file))
    datasets.append(ds)

combined = xr.concat(datasets, dim='time')

combined.to_netcdf('ERA5Land_precipitation_1970_2022_China_daily.nc')
