# Data preparation
## era5cli 
- conda install -c conda-forge era5cli (env:geo-data)
- conda install -c conda-forge xarray netCDF4 scipy bottleneck matplotlib cartopy seaborn nc-time-axis
- register with copernicus/era5 to get userid and passkey
- run era5cli config to store userid and passkey
- cmd: era5cli monthly --variables 2m_temperature --startyear 1940 --endyear 2024 --ensemble --threads 3 --overwrite
 - CLI parameter reference: https://era5cli.readthedocs.io/en/stable/reference/arguments/

# Raw data processing
- run this only if there is a need to extend or to update the data
- the raw data is put in a folder outside of the repository, i.e. `/graphs-for-sm/../data/copernicus/2m-temperature`
- the netCDF4 .nc files ares processed one by one, and might take a while to finish
- the result is .tsv.zip files in the repository data folder `/graphs-for-sm/data/copernicus/2m-temperature/`
- delete original .nc files after the processing has finished to clear up storage space


In [1]:
from pathlib import Path

In [2]:
cwd = Path.cwd()
str(cwd)

'c:\\Dropbox\\CODE2\\data-analysis\\github\\graphs-for-sm\\graphs-for-sm\\notebooks'

In [42]:
data_folder = cwd / "../../data/copernicus/monthly/2m-temperature"
if not data_folder.is_dir():
    raise RuntimeError("data folder not found: " \
    "the data must be downloaded and in an accessible folder")

In [43]:
local_data_folder = cwd / "../data/copernicus/monthly/2m-temperature"
if not local_data_folder.is_dir():
    local_data_folder.mkdir()

In [52]:
all_nc_files = [p for p in data_folder.glob("*.nc")]

In [7]:
import xarray as xa

In [53]:
if all_nc_files:
    for p in all_nc_files:
        print(f"processing {p.name}")
        print("opening file as xarray Dataset")
        ds = xa.open_dataset(str(p))
        print("creating pandas DataFrame")
        df = ds.to_dataframe()
        print("transforming temperature from Kelvin to Celcius")
        df["t2m"] = df["t2m"] - 273.15
        print("calculating mean for ensemble (reducing values to one value per grid location)")
        df = df.groupby(level=["longitude", "latitude", "time"]).mean()
        tsv_filename = p.stem + ".tsv.zip"
        print(f"saving DataFrame locally as zipped tsv: {tsv_filename}")
        df.to_csv(local_data_folder / tsv_filename, sep="\t")
        print("cleaning up references")
        del(ds)
        del(df)
        print(f"{p.name} processed successfully")
        print("-----------------------------")
else:
    print("No .nc files to process!")

processing era5_2m_temperature_1941_monthly_ensemble.nc
opening file as xarray Dataset
creating pandas DataFrame
transforming temperature from Kelvin to Celcius
calculating mean for ensemble (reducing values to one value per grid location)
saving DataFrame locally as zipped tsv: era5_2m_temperature_1941_monthly_ensemble.tsv.zip
cleaning up references
era5_2m_temperature_1941_monthly_ensemble.nc processed successfully
-----------------------------
processing era5_2m_temperature_1942_monthly_ensemble.nc
opening file as xarray Dataset
creating pandas DataFrame
transforming temperature from Kelvin to Celcius
calculating mean for ensemble (reducing values to one value per grid location)
saving DataFrame locally as zipped tsv: era5_2m_temperature_1942_monthly_ensemble.tsv.zip
cleaning up references
era5_2m_temperature_1942_monthly_ensemble.nc processed successfully
-----------------------------
processing era5_2m_temperature_1943_monthly_ensemble.nc
opening file as xarray Dataset
creating pa

# Validation
- check if data is stored and can be loaded into a pandas DataFrame

In [44]:
import pandas as pd

In [48]:
test_year = int(1940)
test_filename = f"era5_2m_temperature_{test_year}_monthly_ensemble.tsv.zip"
df = pd.read_csv(local_data_folder / test_filename, sep="\t", index_col=["longitude", "latitude", "time"])
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,t2m
longitude,latitude,time,Unnamed: 3_level_1
0.0,-90.0,1940-01-01,-28.830702
0.0,-90.0,1940-02-01,-37.942286
0.0,-90.0,1940-03-01,-46.964342
0.0,-90.0,1940-04-01,-51.481642
0.0,-90.0,1940-05-01,-50.214846


In [49]:
del(df)