# Xarray Dataset conversions and exports

In [1]:
# initialization
import numpy as np
import pandas as pd
import xarray as xr

## From pandas DataFrame to xarray Dataset

Occasionally, you may need to convert a pandas DataFrame into an xarray Dataset and vice versa. To give a concrete example, consider the CalCOFI data we examined in week 6. One may argue that the data is grid-like when you consider time and depth as coordinates, and there may be advantage of turning it into an xarray Dataset. 

To see how this may work, we load a version of the CalCOFI subset in which the depth is binned (you can download a copy of the file [here](https://github.com/OCEAN-215-2025/preclass/tree/main/week_07/data/CalCOFI_binned.csv):

In [2]:
CalCOFI2 = pd.read_csv("data/CalCOFI_binned.csv", parse_dates = ["Datetime"])
display(CalCOFI2)

Unnamed: 0,Cast_Count,Station_ID,Datetime,Depth_bin_m,T_degC,Salinity,SigmaTheta
0,992,090.0 070.0,1950-02-06 19:54:00,5,14.040,33.1700,24.76600
1,992,090.0 070.0,1950-02-06 19:54:00,15,13.950,33.2100,24.81500
2,992,090.0 070.0,1950-02-06 19:54:00,25,13.900,33.2100,24.82600
3,992,090.0 070.0,1950-02-06 19:54:00,35,13.810,33.2180,24.85100
4,992,090.0 070.0,1950-02-06 19:54:00,55,13.250,33.1500,24.91200
...,...,...,...,...,...,...,...
6402,35578,090.0 070.0,2021-01-21 13:36:00,205,8.518,34.0402,26.44858
6403,35578,090.0 070.0,2021-01-21 13:36:00,255,8.104,34.1405,26.59119
6404,35578,090.0 070.0,2021-01-21 13:36:00,275,8.012,34.1498,26.61270
6405,35578,090.0 070.0,2021-01-21 13:36:00,305,7.692,34.1712,26.67697


To convert the pandas DataFrame to xarray Dataset, we need to tell xarray which column ("variable") are coordinates. In our case the columns are `Datetime` and `Depth_bin_m`, and we need to convert these into row indices, which is achieved using the `.set_index()` method of the DataFrame. Moreover, let say we want to retain only `T_degC`, `Salinity`, and `SigmaTheta` as our data variables, so we select them using the `.loc[]` method *after* setting the index

In [3]:
CalCOFI3 = CalCOFI2.set_index(["Datetime", "Depth_bin_m"]).loc[:, ["T_degC", "Salinity", "SigmaTheta"]]
display(CalCOFI3)

Unnamed: 0_level_0,Unnamed: 1_level_0,T_degC,Salinity,SigmaTheta
Datetime,Depth_bin_m,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1950-02-06 19:54:00,5,14.040,33.1700,24.76600
1950-02-06 19:54:00,15,13.950,33.2100,24.81500
1950-02-06 19:54:00,25,13.900,33.2100,24.82600
1950-02-06 19:54:00,35,13.810,33.2180,24.85100
1950-02-06 19:54:00,55,13.250,33.1500,24.91200
...,...,...,...,...
2021-01-21 13:36:00,205,8.518,34.0402,26.44858
2021-01-21 13:36:00,255,8.104,34.1405,26.59119
2021-01-21 13:36:00,275,8.012,34.1498,26.61270
2021-01-21 13:36:00,305,7.692,34.1712,26.67697


We are now ready to convert this DataFrame into an xarray Dataset, and all it takes is a `xr.Dataset.from_dataframe()` call:

In [4]:
CalCOFI_xr = xr.Dataset.from_dataframe(CalCOFI3)
display(CalCOFI_xr)

Note that xarray automatically "complete" the grid and fill in some missing values for us (e.g., there is no measurement near 385 m on 1950-02-06. This column is *implicitly missing* in the pandas DataFrame, but become *explicitly* missing in the xarray Dataset, since measurement near 385 m did happen on other dates): 

In [5]:
CalCOFI2.loc[(CalCOFI2["Datetime"] == pd.to_datetime("1950-02-06 19:54")) & (CalCOFI2["Depth_bin_m"] == 385)]

Unnamed: 0,Cast_Count,Station_ID,Datetime,Depth_bin_m,T_degC,Salinity,SigmaTheta


In [6]:
CalCOFI_xr.sel(Datetime=pd.to_datetime("1950-02-06 19:54"), Depth_bin_m = 385)

## From xarray Dataset to pandas DataFrame

For the converse (from xarray to pandas), consider the tidal gauge measurement near Key West, FL, courtesy [University of Hawaii Sea Level Center](https://uhslc.soest.hawaii.edu/datainfo/) (you can download a copy of the netCDF file [here](https://github.com/OCEAN-215-2025/preclass/tree/main/week_07/data/tide_gauges.nc))

In [7]:
gauge_xr = xr.open_dataset("data/tide_gauges.nc")
display(gauge_xr)

Observe that the `record_id` dimension only has 4 coordinate values, which largely identifies the station. Thus, the data is essentially 1D, and it make sense to present it in tabular form.

To convert the xarray Dataset to a pandas DataFrame, all you need is to call the `.to_dataframe()` method of the Dataset:

In [8]:
gauge_pd = gauge_xr.to_dataframe()
display(gauge_pd)

Unnamed: 0_level_0,Unnamed: 1_level_0,sea_level,lat,lon,station_name,station_country
time,record_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1913-01-19 06:00:00.000000,2570,,26.690001,281.016998,Settlement Point,Bahamas (the)
1913-01-19 06:00:00.000000,7550,,25.732000,279.838013,"Virginia Key, FL",United States of America (the)
1913-01-19 06:00:00.000000,7620,,30.403000,272.786987,"Pensacola, FL",United States of America (the)
1913-01-19 06:00:00.000000,2420,1128.0,24.552999,278.191986,"Key West, FL",United States of America (the)
1913-01-19 07:00:00.028800,2570,,26.690001,281.016998,Settlement Point,Bahamas (the)
...,...,...,...,...,...,...
2024-09-30 22:00:00.028800,2420,1893.0,24.552999,278.191986,"Key West, FL",United States of America (the)
2024-09-30 22:59:59.971200,2570,1640.0,26.690001,281.016998,Settlement Point,Bahamas (the)
2024-09-30 22:59:59.971200,7550,3894.0,25.732000,279.838013,"Virginia Key, FL",United States of America (the)
2024-09-30 22:59:59.971200,7620,2965.0,30.403000,272.786987,"Pensacola, FL",United States of America (the)


Notice that the coordinates (`time` and `record_id`) of the Dataset have become indices (row labels) of the DataFrame. To convert the indices to regular columns, we apply the `.reset_index()` method:

In [9]:
gauge_pd = gauge_pd.reset_index()
display(gauge_pd)

Unnamed: 0,time,record_id,sea_level,lat,lon,station_name,station_country
0,1913-01-19 06:00:00.000000,2570,,26.690001,281.016998,Settlement Point,Bahamas (the)
1,1913-01-19 06:00:00.000000,7550,,25.732000,279.838013,"Virginia Key, FL",United States of America (the)
2,1913-01-19 06:00:00.000000,7620,,30.403000,272.786987,"Pensacola, FL",United States of America (the)
3,1913-01-19 06:00:00.000000,2420,1128.0,24.552999,278.191986,"Key West, FL",United States of America (the)
4,1913-01-19 07:00:00.028800,2570,,26.690001,281.016998,Settlement Point,Bahamas (the)
...,...,...,...,...,...,...,...
3916579,2024-09-30 22:00:00.028800,2420,1893.0,24.552999,278.191986,"Key West, FL",United States of America (the)
3916580,2024-09-30 22:59:59.971200,2570,1640.0,26.690001,281.016998,Settlement Point,Bahamas (the)
3916581,2024-09-30 22:59:59.971200,7550,3894.0,25.732000,279.838013,"Virginia Key, FL",United States of America (the)
3916582,2024-09-30 22:59:59.971200,7620,2965.0,30.403000,272.786987,"Pensacola, FL",United States of America (the)


We can now manipulate this DataFrame using DataFrame methods, export the results as a csv, and so on.

## Combine multiple xarray Datasets

As in the case of csv files, sometimes you can only download netCDF file for a subset of the data you want, and before analysis you'll need to combine multiple xarray Dataset into one. As an example, the 3 .nc files below are ocean surface temperature from [OISST](https://www.ncei.noaa.gov/products/optimum-interpolation-sst) in 2025 on [Jan 15](https://github.com/OCEAN-215-2025/preclass/tree/main/week_07/data/oisst-20250115.nc), [Feb 15](https://github.com/OCEAN-215-2025/preclass/tree/main/week_07/data/oisst-20250115.nc), and [Mar 15](https://github.com/OCEAN-215-2025/preclass/tree/main/week_07/data/oisst-20250115.nc).

In [10]:
oisst_20250115 = xr.open_dataset("Data/oisst-20250115.nc")
oisst_20250215 = xr.open_dataset("Data/oisst-20250215.nc")
oisst_20250315 = xr.open_dataset("Data/oisst-20250315.nc")

In [11]:
display(oisst_20250115)

We can merge these xarray Dataset into one using `xr.concat()`, using `dim="time"` to tell xarray that we want to merge along the time dimension:

In [12]:
oisst_all = xr.concat([oisst_20250115, oisst_20250215, oisst_20250315], dim="time")
display(oisst_all)

Occasionally you'll need to create an extra dimension so that your individual Dataset can be merged along this new dimension. To do so we can use the `.expand_dims()` method of both Dataset and DataArray (which create the dimension), followed by the `.assign_coords()` method (which assign coordinates to the new dimension). 

Take `oisst_20250115` as an example, suppose we want to create a new dimension call month, we may do:

In [13]:
oisst_20250115.expand_dims(dim={"month": 1}).assign_coords({"month": [1]})

## Export xarray Datasets as netCDF file

As in the case of pandas DataFrame, sometimes we want to save the Dataset obtained after some manipulation into a new netCDF file. We can do so using the `.to_netcdf()` method of Dataset and DataArray. Importantly, you may want to make sure that each data variable is compressed so that your file will not be exceedingly large. The way to specify compression is to supply a nested dictionary to the `encoding` argument, where each data variable is a key, with the value itself a dictionary specifying compression option. As an example, to save the `oisst_all` Dataset we created

In [14]:
# NOTE: the output folder has to already exist
oisst_all.to_netcdf("output/oisst_all.nc", encoding = {
    "sst": {"zlib": True, "complevel": 9},
    "anom": {"zlib": True, "complevel": 9},
    "err": {"zlib": True, "complevel": 9},
    "ice": {"zlib": True, "complevel": 9}
})